Диссертация (1137066), страница 5
Текст из файла (страница 5)
al, we use several from their top-10 list of data-mining algorithms in[10] to perform benchmark analysis, such as k-means, naive Bayes, etc.FCA-based algorithms also belong to the second (interpretable) group of algorithms since they use concepts in order to classify objects. The intent of the conceptcan be interpreted as a set of rules that is supported by the extent of the concept.However, for non-binary data the computation of the concepts and their relations canbe very time-consuming. In case of credit scoring we deal with numerical data, assoon as categorical variables can be transformed into set of dummy variables.
Lazyclassification [82] seems to be appropriate to use in this case since it provides the decision maker with the set of rules for the loan application and can be easily paralleled.Also, early attempts to use FCA based methods to statistically derived solution for19classification problem are found in [79]. The similarities and differences in proposedapproaches will be discussed together with Query-Based Classification Algorithm infollowing sections.3.1Formal Concept AnalysisLet G be a set (of objects), let (D, u) be a meet-semi-lattice (of all possible ob-ject descriptions) and let δ : G → D be a mapping. Then (G, D, δ), whereD = (D, u), is called a pattern structure [51], provided that the setδ(G) := {δ(g)|g ∈ G} generates a complete subsemilattice (Dδ , u) of (D, u).
Elements of D are called patterns or descriptions and are naturally ordered by subsumption relation v:given c, d ∈ D one has c v d ↔ c u d = c. Operation u is also called a similarity operation. A pattern structure (G, D, δ) gives rise to the following derivationoperators (·) :A =lδ(g)for A ∈ G,g∈Ad = {g ∈ G | d v δ(g)}for d ∈ (D, u).These operators form a Galois connection between the powerset of G and (D, u).The pairs (A, d) satisfying A ⊆ G, d ∈ D, A = d, and A = d are called pattern concepts of (G, D, δ), with pattern extent A and pattern intent d. Operator (·) is analgebraic closure operator on patterns, since it is idempotent, extensive, and monotone [51].The concept-based learning model for standard object-attribute representation(i.e., formal contexts) is naturally extended to pattern structures.Suppose we have a set of positive examples G+ and a set of negative examplesG− w.r.t.
a target attribute, G+ ∩ G− = ∅, objects fromGτ = G \(G+ ∪ G− ) are called undetermined examples. The description set, denotedby D, consists of tuples with intervals as its elements, i.e. D = {([a1 ; b1 ], . . . , [aK , ; bK , ])}∀i20ai , bi ∈ R}, where K is dimensionality of attribute space.
For example, for K = 3one can provide the following element of D : d = ([1; 2], [−0.5; 0.3], [150; 340]).For two descriptions d1 , d2 ∈ D, d1 = ([a1 ; b1 ], . . . , [aK ; bK ]) and d2 = ([m1 ; n1 ], . . . , [mmeet operation u is defined: d1 ud2 = ([min(a1 , m1 ); max(b1 , n1 )], ..., [min(aK , mK ); max(If d1 u d2 = d1 , then it is denoted that d1 ⊆ d2 Interval pattern structure is a triplet(G, D, δ) where D = (D, u), i.e.
a set of objects with a set of possible descriptions,meet operation u and a mapping δ.A pattern c ∈ D is an α - weak positive premise (classifier) iff:|c ∩ G− |≤ α and ∃A ⊆ G+ : c v A|G− |A pattern h ∈ D is an α - weak negative hypothesis iff:|h ∩ G+ |≤ α and ∃A ⊆ G− : h v A|G+ |In case of credit scoring we work with pattern structures on intervals as soon asa typical object-attribute data table is not binary, but has many-valued attributes.
Instead of binarizing (scaling) data, one can directly work with many-valued attributesby applying interval pattern structure.3.2Lazy Classification with Pattern StructuresSuppose, we have a training set (negative and positive examples with some tar-get attribute) and unclassified test objects. Also, let the object descriptions have thepattern structure. Then, the target attribute of a test object gn can be computed as closure (w.r.t. (G, (D, u), δ) of intersection of gn description with descriptions of everyobject g ∈ G.
If for some object g the closure contains the target attribute, gn is classified as positive, otherwise the test object is classified as negative. Computationally,this can be described as the following simple two-stage procedure:• For every g ∈ G compute (δ(gn ) u δ(g)) , i.e. select all objects from G whosedescriptions contain (δ(gn ) u δ(g)) . This takes O(|G| · (p(u) + |G| · p(v)))time.21• If for some g ∈ G all objects from (δ(gn ) u δ(g)) have the target attribute,classify gn positively, otherwise negatively.
This takes O(|G|2 ) time for looking for the target attribute in object descriptions in at most |G| families of objectsubsets, each subset consisting of at most |G| objects.Lazy classification is thus reduced to computing (δ(gn ) u δ(g)) and testing thetarget attribute in all objects of this set. This computation is easily parallelizable: onepartitions the dataset G into G = G1 ∪ . . . ∪ Gk , where k is the number of processors,computes in each Gi the set of objects (δ(gn ) u δ(g)) , tests the target attribute for allobjects in the union of these sets over i.3.3Query-Based Classification AlgorithmIn credit scoring the object-attribute data is typically numerical. Factors canhave arbitrary distributions and take wide range of values.
At the same time categorical variables and dummies can be present. With relatively large number of attributes(over 30-40) it produces high-dimensional space of continuous variables. That iswhen the result of meet operator tends to be very specific, i.e. for almost every g ∈ Gonly g and gn have the description δ(gn ) u δ(g). This happens due to the fact thatnumerical variables, ratios especially, can have unique values for every object. Thisresults in that for test object gn the number of positive and negative premises is closeto the number of observations in sets of positive and negative examples correspondingly.
In other words, too specific descriptions are usually not falsified (i.e. thereare no objects of opposite class which satisfy description) and almost always formeither positive or negative premises. Therefore, the idea of voting scheme for lazyclassification in the case of high dimensional numerical data may turn out to be obscure. Thus, it seems reasonable to seek the concepts with larger extent and withnot too specific intent. At the same we would like to preserve the advantages of lazyclassification, e.g. no need to compute full concept lattice, easy parallelization etc.The way to increase the extent of the generated concepts is to consider intersection of the test object with more than one element from the set of positive (negative)examples.
What is the suitable number of objects to take for intersection? In our22modification we consider this as a hyperparameter subsample size and perform gridsearch. The hyperparameter is expressed as percentage of the observations in the setof positive (negative) examples. As subsample size grows, the resulting intersectionδ(g1 ) u . . . u δ(gk ) u δ(g) becomes more generic and it is more frequently falsified bythe objects from the set of examples with an opposite class label. Strictly speaking,in order to replicate the lazy classification approach using subsample size hyperparameter, one should have considered all possible combinations of subsample size ofobjects from the sets of positive (negative) examples. Apparently, this is not applicable in the case of large datasets. For example, having 10 000 objects in set of positiveexamples and having subsample size equal to only two objects will produce almost 50mln combinations for intersection with the test object.
Therefore, we randomly takethe chosen number of objects from set of positive (negative) examples as candidatesfor intersection with the test object. The number of times (i.e. number of iterations)we randomly extract a subsample from the set of examples is the second hyperparameter of the algorithm, which is also tuned through grid search. Intuition says, thehigher the value of the hyperparameter the more premises should be mined from thedata. However, the obvious penalty for increasing the value of this hyperparameter istime and resources required for computing intersections.As we mentioned, the greater the subsample size, the more it is likely that(δ(g1 ) u . .