Диссертация (1137066), страница 8

Файл №1137066 Диссертация (Рандомизированные алгоритмы на основе интервальных узорных структур) 8 страницаДиссертация (1137066) страница 82019-05-202019-05-20СтудИзба

Рандомизированные алгоритмы на основе интервальных узорных структур

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 8)

By increasing the number of premises used forclassification according to voting scheme, we are likely to capture the structure of thedata in more detail.However, the number of iterations is not the only driver of the classificationaccuracy in our case. We find a range with relatively high Gini in the area of mildalpha-threshold and relatively high subsample size.

It also seems natural as soon asthe support of a good predictive rule (i.e. premise) is expected to be higher than itssupport within the set of opposite examples. We elaborate further and run additionalgrid search in range of hyperparameters providing high Gini coefficient:36Table 2. Gini coefficients for the hyperparameters grid search on specified areaSubsample sizeAlpha-thresh-old Number of iterations 1.0% 1.1% 1.2% 1.3% 1.4% 1.5%0.3%50051%49%48%43%41%38%100052%51%48%45%43%39%200054%53%49%47%46%38%500055%52%50%47%46%40%1000056%53%50%47%47%40%2000055%53%51%46%48%41%According to performed grid search the range with the highest Gini (55%-56%)on the test sample is in range with following hyperparameter values: alpha-threshold= 0,3%, number of iterations = 10000, subsample size = 1,0%.The result was compared to three benchmarks that are traditionally used in thecredit scoring within the bank system: logistic regression, scorecard and decisiontree.It should be cleared what is implied by the scorecard classifier.

Mathematical architecture of the scorecard is based on logistic regression which takes the transformedvariables as input. The transformation of the initial variables which is typically usedis WOE-transformation [63]. It is wide-spreaded in credit scoring to apply such atransformation to the input variables as soon as it accounts for non-linear dependencies and it also provides certain robustness coping with potential outliers.

The aimof the transformation is to divide each variable into not more than k categories. Thethresholds are derived so as to maximize the information value of a variable [63].Having each variable binned into categories, the log-odds ratio is calculated for eachcategory. Finally, instead of initial variables the discrete valued variables are considered as input in logistic regression.The properties of the decision tree were as follows: we ran CART with twopossible child nodes from each parent node. The criterion for optimal threshold calculation was the greatest entropy reduction. The number of terminal nodes was notexplicitly restricted; however, the minimum size of the terminal node was set to 50.37As far as logistic regression is concerned, the variable selection was performedbased on stepwise approach [64].

As for scorecard, the variables were initially selected based on their information value after the WOE-transformation. The comparison of the classifiers performance based on test sample of 300 objects is given inTable 3.Table 3. Query-based classification algorithm versus models adopted in the bankGini on test sampleLogistic regression47.38%Scorecard(Logistic based on WOE-transformation)CART (minsize= 50)51.89%54.75%QBCA56.30%(s = 1%, a=0.3%,n=10000)AdaBoostClassifier54.72%KNeighborsClassifier44.00%NaiveBayes48.91%RandomForestClassifier53.42%When dealing with large numerical datasets, lazy classification may be preferable to classification based on explicitly generated classifiers, since it requires lesstime and memory resources [53].

However, the original lazy classification with pattern structure in case of high dimensional numerical feature space meets certain limitation. The limitation is that, when intersecting descriptions of a test object and everyobject from the set of examples, one is likely to acquire premises with image consisting only of those two objects. In other words, the premises tend to be very specificand, therefore, the number of positive and negative premises is likely to be equal tothe number of the objects.

The weighting cannot be considered helpful in this caseas soon as the premises will have very low support.We modified the original lazy classification setting by making it, in fact, a randomized procedure with three hyperparameters: subsample size, number of iterations38and alpha-threshold. Therefore, we defined α-weak premise as the premise which isfalsified by less than α share of examples of an opposite class.In effect, the modified algorithm mines the α-weak premises with relatively highsupport that will be used for the classification of the test object. The classification isthen carried out upon the predefined voting scheme.We applied the introduced procedure to the retail loan classification problem.The data we used for was provided during the pilot project with one of the top-10banks in Russia, the details are not provided due to non-disclosure agreement.

Thesets of positive and negative examples both had 1000 objects with 28 numerical attributes. The accuracy of the algorithm was evaluated on the test dataset consistingof 300 objects. Gini coefficient was chosen as accuracy metric. We performed thebasic grid search by running the query-based classification algorithm with differenthyperparameter values. Algorithm’s Gini metrics was compared to the conventionally adopted models used in the bank. The benchmark models were logistic regression, scorecard and decision tree.

The proposed algorithm outperforms the logisticregression the scorecard with the subsample size hyperparameter around 1%, alphathreshold equal to 0,3% and with number of iterations over 5000. The performanceof the decision tree is at the comparable level with the proposed algorithm, however,the query-based classification is slightly better in terms of Gini coefficient.3.6Experiments with open dataWe decided to retrieve open dataset devoted to the credit scoring. We consideredthe “Give Me Some Credit” contest held in 20121 .

The data has a binary target variable (class label) whether the borrower defaulted or not. However, it is not specifiedwhether the default event was ordinary or fraudulent. We develop a scorecard andexamine its accuracy via out-of-sample validation with provided target variable. Thevalidation process requires calculation of performance metrics (ROC AUC and Ginicoefficient) of the model based on the data sample that was retrieved from the samedistribution but was not used to develop the model itself.

This approach allows the1https://www.kaggle.com/c/GiveMeSomeCredit39user to check for accuracy and stability of the model. In order to train the modelswe extracted 1000 good loans and 1000 bad loans. The size of the validation set was300 observations. All these observations were randomly extracted from the contestdataset.

Our aim was to compare classical scorecard versus black-box models such asboosting versus query-based classification approach based on interval patterns. Thefeatures for loan default prediction are presented in Table 4:Table 4. Kaggle Data DescriptionVariable NameDescriptionSeriousDlqin2yrsPerson experienced 90 days past due delinquency or worse Y/NTypeTotal balance on credit cards and personal lines of creditRevolvingUtilizationOfUnsecuredLinesexcept real estate and no installment debt like carpercentageloans divided by the sum of credit limitsageNumberOfTime30-59DaysPastDueNotWorseDebtRatioMonthlyIncomeNumberOfOpenCreditLinesAndLoansNumberOfTimes90DaysLateNumberRealEstateLoansOrLinesNumberOfTime60-89DaysPastDueNotWorseNumberOfDependentsAge of borrower in yearsNumber of times borrower has been 30-59 days past duebut no worse in the last 2 years.Monthly debt payments, alimony, living costs dividedby monthly gross incomeMonthly incomeNumber of Open loans (installment like car loan ormortgage) and Lines of credit (e.g.

credit cards)Number of times borrower has been 90 daysor more past due.Number of mortgage and real estate loans includinghome equity lines of creditNumber of times borrower has been 60-89 days pastdue but no worse in the last 2 years.Number of dependents in family excluding themselves(spouse, children etc.)integerintegerpercentagerealintegerintegerintegerintegerintegerFirst, we concluded that the variable distributions might be not very appropriatefor applying trees-like transformations.

The values of features are evenly distributedacross wide ranges both for good and bad loans, therefore applying cutpoint does notperform well to distinguish among loan applicants. Examples of such distributionsare presented below:40Figure 5. Age distribution by goods and bads (left), number of open credit lines andloans by goods and bads (middle), and monthly applicant income by goods and bads(right).In order to build scorecard we applied WOE-transformation to the variables (using rpart and smbinning packages in R) on training sample. The WOE-transformationwas controlled for maximum number of observations in the final nodes of one-factortrees in order to escape overfitting at the starting point.

Характеристики

Тип файла

PDF-файл

Размер

3,66 Mb

Материал

Рандомизированные алгоритмы на основе интервальных узорных структур

Тип материала

Кандидатская диссертация

Предмет

Технические науки

Высшее учебное заведение

НИУ ВШЭ

Список файлов диссертации

randomizirovannye-algoritmy-na-osnove-intervalnyh-uzornyh-struktur.rar

Рандомизированные алгоритмы на основе интервальных узорных структур

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.