Summary (Рандомизированные алгоритмы на основе интервальных узорных структур), страница 2
Описание файла
Файл "Summary" внутри архива находится в папке "Рандомизированные алгоритмы на основе интервальных узорных структур". PDF-файл из архива "Рандомизированные алгоритмы на основе интервальных узорных структур", который расположен в категории "". Всё это находится в предмете "технические науки" из Аспирантура и докторантура, которые можно найти в файловом архиве НИУ ВШЭ. Не смотря на прямую связь этого архива с НИУ ВШЭ, его также можно найти и в других разделах. , а ещё этот архив представляет собой кандидатскую диссертацию, поэтому ещё представлен в разделе всех диссертаций на соискание учёной степени кандидата технических наук.
Просмотр PDF-файла онлайн
Текст 2 страницы из PDF
It covers the key role modeling is playing in risk management andreviews widespread statistical algorithms used for classification and regression tasks.In context of credit risk assessment two parameters are emphasized: probability ofdefault (PD), loss given default (LGD). From data science standpoint PD estimationis a binary classification problem, and LGD estimation is regression problem.
Tradeoff between accuracy of prediction versus model interpretability is emphasized assoon as some regulators require banks to be able to provide reject reasons forborrowers and also when central banks examine the bank models they are willing tounderstand economic intuition behind them to prove the models are going to showexpected and stable performance.Loan default prediction with the use of scorecards is discussed as soon as thismethod is widely adopted within banking industry and is used as benchmark for“white-box” models thereafter.
The raw factors weight-of-evidence (WOE)transformation is designed to account for outliers and non-monotonous dependenciesin an adequate way before feeding data into logistic classifier.“Black-box” models are discussed with an example of neural networks whichare opposed to transparent models which provide user with understanding why thealgorithm predicts particular probabilities of default for client.The third section contains the first novelty: application of formal conceptanalysis (FCA) to classification problem with datasets of large number ofobservations.
Basic FCA definitions are provided (pattern structure, meet operator,derivation operator, pattern intent and extent). New definitions for α-weak premises3are provided.Suppose we have a set of positive examples + (objects of positive class) and aset of negative examples − (objects of negative class), + ∩ − = ∅, и + ⋃ − =. Let the description set is denoted by D, which consists of tuples with intervals asits elements, i.e. = {([1 ; 1 ], … , [, ; , ]) | ∀: , ∈ } , where K isdimensionality of attribute space. For example, for K=3 one can provide thefollowing element of D: d = ([1;2], [-0.5;0.3], [150;340]).3also known as classifiers, hypothesesLetusprovidemapping: → suchthatfor ∈ : () = ([1 ; 1 ], … , [ ; ]), i.e.
each object has its own description as apoint in K-dimensional real number space.Fortwodescriptions1 , 2 ∈ ,1 = ([1 ; 1 ], … , [K ; K ])],and 2 = ([1 ; 1 … , [K ; K ]) meet operation ⊓ is defined:1 ⊓ 2 = ([min(1 , 1 ); max(1 , 1 )], … , [min( , ); max( , )])If 1 ⊓ 2 = 1 , then it is denoted that 1 ⊑ 2Interval pattern structure is a triplet (, , ), где = (,⊓), i.e. a set ofobjects with a set of possible descriptions, meet operation ⊓ and a mapping .Also we define a mapping from set of objects G to description set D and viceversa, denoting it with ⋄ :⋄ =⊓∈ () for ⊆ , ⋄ = { ∈ | ⊑ ()} for ∈ .New definitions of α-weak premises are introduced.
Description + ∈ iscalled an -weak positive premise if:⋄ ∩ ||+−|− |≤ , and ∃ ⊆ + : + ⊑ ⋄Description − ∈ is called an -weak negative premise if:⋄ ∩ ||−+|+ |≤ , and ∃ ⊆ − : − ⊑ ⋄Query-based classification algorithm (“lazy classification”) is introduced. Thealgorithm takes set of positive and negative examples (+ and − ), set of test objects with corresponding descriptions and mapping as input. The output of thealgorithm is a real number ∆∈ assigned for each test object from ∈ .This number ∆ serves as a credit score and allows one to build cutoff decision rulessuch as “if ∆> then belongs to positive class”. The idea behind the algorithmis to check whether it is the set of positive or negative examples the test object ismore similar to.
The similarity is defined as a total support of α-weak positive(negative) premises that contain the description of test object. The support of an αweak positive description of + is called |+⋄ ⋂+ |, that is, the number of objectsfrom the set of positive examples + satisfying the description of + . The support ofan α-weak negative description of − is called |−⋄ ⋂− |, that is, number of objectsfrom the set of negative examples − satisfying the description of − .
Let there be pα-weak positive descriptions and n α-weak negative descriptions, all of them containthe description of the test object ( ) , i.e. ∀ = 1, … , : + ⊑ ( ) and∀ = 1, … , : − ⊑ ( ). The total support for α-weak positive descriptions is⋄ = ∑=1 |+⋂+ | , and the total support for α-weak negative descriptions is⋄ = ∑=1 |−⋂− |. Based on the value ∆= − , an estimation is made whetherthe test object is more similar to objects from a set of positive or negative examples;it serves as credit score for the borrower's creditability assessment. The paper alsoconsiders other similarity measures and voting schemes based on α-weakdescriptions (see section 3.4 of the dissertation).The algorithm is an iterative procedure and uses three hyperparameters:subsample size, number of iterations and alpha-threshold.
The first hyperparameteris a percentage of objects in a set of positive (negative) examples which arerandomly extracted within each iteration. At each iteration the subsample is extractedfrom − and + and objects descriptions in subsample are intersected (⊓) with thedescription of test object : = (1 ) ⊓ … ⊓ ( ) ⊓ ( )where ⁄|| = _ .The number of times (number of iterations) we randomly extract a subsamplefrom the set of examples is the second hyperparameter of the algorithm, which isalso tuned through grid search.
If is not α-weak premise then it is ignored, if isα-weak premise then is saved in order to be used in classification of the test objectlater.These steps are performed for each test object for positive and negative set ofexamples separately, producing a set of positive and negative α-weak premises. Thefinal output of the algorithm is a difference between the total support for α-weakpositive premises and the total support for α-weak negative premises for the testobject. Based on this output we calculate model quality metrics that is widely used incredit scoring – Gini coefficient.The algorithm is tested on both internal top-10 bank data and open Kaggledata. The positive set of examples is a set of loans where the target attribute ispresent. The target attribute in credit scoring is defined as more than 90 days ofdelinquency within the first 12 months after the loan origination.
Each set ofexamples consists of 1000 objects in order that voting scheme concerned in thesecond section was applicable. The test dataset consists of 300 objects and isextracted from the same population as the sets of positive and negative examples.Attributes represent various metrics such as loan amount, term, rate, payment-toincome ratio, age of the borrower, undocumented-to-documented income, credithistory metrics etc. The set of attributes used for the lazy classification trialscontained 28 numerical attributes. In order to evaluate the accuracy of theclassification Gini coefficient is calculated for every combination ofhyperparameters based on 300 predictions on the test set. Gini coefficient iscalculated based on the margin between the number of objects within positivepremises and negative ones.
The margin is considered as an measure similar to scorevalue in credit scorecards. Hyperparameter grid search is performed:Gini coefficients for the hyperparameters grid search (QBCA)Gini coefficients for the hyperparameters grid search on specified areaQuery-based classification algorithm versus classical models adopted in banks andother benchmarks for top-10 bank dataAs far as open data is concerned the algorithm was tested on Kaggle data of“Give Me Some Credit” contest held in 20124. The data has a binary target variable(class label) whether the borrower defaulted or not.Query-based classification algorithm versus benchmarks for Kaggle credit scoringopen dataset4https://www.kaggle.com/c/GiveMeSomeCreditApart from accuracy measures sensitivity analysis is performed andalgorithms properties are analyzed.Also, visualization of collections of α-weak premises are presented thatallows one to interpret the model outcome for the client.
In effect, when consideringa test object target class label (good or bad) the algorithms builds portraits of goodand bad clients on historical data in multi-dimensional feature space.Below are several examples of two-way areas are provided for different levelsof number of iterations:Positive premises are depicted in red and negative are in blue. To constructeach positive premise, two objects from the set of positive examples were randomlyextracted. Then the meet-operator was applied and a set of intervals was obtained.After that, only the intervals for two features were left. The same algorithm wasperformed for negative premises.It is argued that this set of areas in feature space gives the understanding ofwhy a particular borrower was considered of high or low credit risk by the model.The fourth section contains the second novelty: adoption of FCA toregression problem (i.e.