Summary (1137065), страница 3
Текст из файла (страница 3)
the target variable is distributed continuously). In order tomake FCA techniques applicable to this case new definitions of augmented intervalpattern structure is given.An augmented interval pattern structure is a quadruplet: (, , , ℎ), where is a set of objects, is a set of possible object descriptions, ∈ , and ⊓ is a meetoperator. Description d in credit scoring domain is a tuple which consist of twoelements dx and dy (dy – is an interval for target attribute ∈ , аnd dx – a tuple ofintervals for explanatory attributes x, which are supposed to predict target attributey).Let there be a mapping δ: G → D and additionally empirical distributionfunction h ∈ H, where H is a density functions family for target attribute. We willalso use notation δx и δy, to distinguish between descriptions containing explanatoryattributes and target attribute correspondingly.
The meet operator ⊓ definition is leftunchanged.Suppose, we have an arbitrary set of objects A0 ∈ G, i.e.:0 = {1 , 2 , … , },( ) = ( , ) = ([1 ; 1 ], … , [ ; ], [ ; ])for = 1, … , ,where J is a number of explanatory attributes. Then we define derivationoperator ⋄ the following way:⋄0 = (0 , ℎ0 )where 0 = {0 , 0 } and 0 = (1 ) ⊓ … ⊓ ( ), and target attributedescription is 0 = (1 ) ⊓ … ⊓ ( ), which is in fact a single interval [ymin,ymax], аnd h0 is mapping 0 → [0; 1] , i.e.
empirical density distribution function oftarget attribute values in A0:∑∈ ⥠[−1, )⊑()ℎ([−1 , )) =, ∀ = 1, … , ||−where 0 = , = ,and ∆ = − −1 = ,⥠ is indicator function.We will use the composition of derivation operator ⋄ in a similar way, it wasused with interval pattern structures, however it will return the image for description0 whatever target description 0 and density function ℎ are:⋄⋄⋄⋄0 = (0 , ℎ0 ) ≝ 0 = 1In order to approach target attribute prediction problem it will be useful to define αweak premise with ω-allowed dropout. Augmented interval pattern structure = ( , ) ∈ is called an α-weak premise with ω-allowed dropout, iff:1−|{ ∈ | − ( − ) ≤ () ≤ + ( − )}|≤||where () is a value of target attribute for object ,and = ⋄ , is an interval [ ; ] for target attribute, and m is a median ofempirical density distribution function h that describes target attribute values withininterval for objects from .Below we provide an example to understand how new definitions work.Let object set be = {1 , 2 , 3 } and description space consists of twoexplanatory attributes 1 , 2 and one target attribute :Objects\Attributes1231303531.52101211.50.50.70.8Let 0 = {1 , 2 }.Then (1 ) = ([30; 30], [10; 10]), (1 ) = [0.5; 0.5] (2 ) = ([35; 35], [12; 12]), (2 ) = [0.7; 0.7]0 = (0 , 0 )0 = (1 ) ⊓ (2 ) = ([30; 35], [10; 12])0 = (1 ) ⊓ (2 ) = [0.5; 0.7]ℎ0 = {0.5,0.7}⋄0 = (0 , ℎ0 )⋄⋄⋄⋄0 = (0 , ℎ0 ) = 0 = 1 = {1 , 2 , 3 }1 = ([30; 35], [10; 12], [0.5; 0.8])ℎ1 = {0.5,0.7,0.8}⋄⋄⋄⋄0 = 1 = (1 , ℎ1 )Description 0 = ([30; 35], [10; 12], [0.5; 0.7]) is a 1/3-weak descriptionwith 1-allowed dropout, as soon as median from 0.5 and 0.7 equals 0.6.The first stage of Query Based Regression Algorithm (QBRA) is mining αweak premises with allowed ω-dropout, the second is to perform prediction for testobject g t based on the mined premises.
Subsample size is a hyperparameter which isthe number of objects being randomly extracted from G. Then α and ωhyperparameters are specified. They control for anti-support in terms of bothfrequency and magnitude. After objects 0 = {1 , . . . , } are randomly extractedone calculates following pattern 0 = (1 ) ⊓ … ⊓ ( ) ⊓ ( ) and densitydistribution function ℎ0 for target attribute values. If 0 is an α - weak premise withallowed ω-dropout then it is added to the collection of premises that will be used forprediction. After premises mining, the next stage which is building up a predictionfor target attribute based on mined premises.
The resulting prediction was defined asa median of mixture of distributions from all premises.To test the algorithm, we used financial data from balance sheets and profitand loss statements of 612 corporate clients from the top-10 Russian bank. Amongothers factors we used assets-to-liabilities ratio, debt-to-equity ratio, earnings beforetaxes and interest payments, return on assets etc. These clients were assessed at thetime of early insolvency signals and the resulting recovery rate was collected.The accuracy of predictions was evaluated in terms of mean absolutedeviation (MAD):where is a target attribute (recovery rate) for i-th client in the test set and ̂is predicted value.
The algorithm was benchmarked with random forest model.MAD distribution shows that lazy algorithm allows one to obtain predictionerror relatively lower than the one with random forest tunings.Distributions represent accuracy achieved for a large number of algorithmruns with unique combination of hyperparameter values.Other benchmarks are provided below:The conclusion summarizes and focuses on that the key feature of riskmanagement practice is, regardless of the model accuracy, it must keepinterpretability.
Formal concept analysis offers attractive instruments to extractknowledge from data as soon as intents of concepts can be considered as associativerules. FCA-based algorithms are suitable for predictive modeling in areas wheremodel interpretation clarity is of great priority.
Also, the results show that theserandomized modifications for classification and regression tasks outperform classicalmethods used in banks such as scorecards and decision trees in terms of Gini andmean absolute deviation. Therefore, it is argued that proposed FCA-basedclassification and regression algorithms can compete with ordinary statisticalinstruments adopted in banks and still provide the sets of rules which were relevantfor loan applicants.In Appendix programming code both for QBCA and MLRA is provided.Some key functions for meet operator, intent and extent calculation, premises miningand final predictions are provided. The language it is provided with is R(https://www.r-project.org/) as soon as it has intuitive syntax and vectorizedlanguage, so that the reader grasps the idea behind the algorithm realizations.However, for production implementations different languages are recommendedsuch as Java or Spark (for distributed systems).Results Summary1.2.3.4.5.6.7.A randomized FCA-based algorithm for classification rules mining isdeveloped.The concept of α-weak premise and other parameters of the algorithm areintroduced.
Prediction accuracy analysis is performed with regard toalgorithm hyperparameters values tuned on credit scoring data.An algorithm is developed that allows to use the device of interval patternstructures in the problem of regression with several hyperparameters: thenumber of iterations, the alpha-threshold, the size of the subsample size, theomega - ω-dropout, penalty for a high variance value of the target variable onthe right side of the expanded pattern (penalty for high deviation).The concept of an augmented interval pattern structure is introduced, theconcept of an ω-dropout for α-weak descriptions is introduced. Newdefinitions help to solve regression problem via formal concepts analysismethods.Proposed algorithms interpretability is analyzed from the standpoint of creditdecision maker.
The accuracy of the algorithms is compared with the modelsof credit scoring and other benchmarks (both “white-box” and “black-box”).Query-based classification algorithm was developed with threehyperparameters: number of iterations, alpha-threshold, subsample size.Accuracy analysis of the algorithm predictions depending on thehyperparameters values was performed, an intuitive explanation is given forthe results obtained.The developed algorithms were implemented as program code in R language..