Диссертация (1137066), страница 9

Файл №1137066 Диссертация (Рандомизированные алгоритмы на основе интервальных узорных структур) 9 страницаДиссертация (1137066) страница 92019-05-202019-05-20СтудИзба

Рандомизированные алгоритмы на основе интервальных узорных структур

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 9)

Therefore, variables werebinned into two to four categories. The examples of variable binning are provided inFig. 6:Figure 6. One-Factor Trees for WOE-transformation of Revolving Utilization ofUnsecured Lines (left) and Monthly Income (right).As soon as we have transformed the factors, the individual Gini coefficients werecalculated to assess the predictive power of the coefficients. We excluded variablesthat have shown dramatic drop in Gini on validation sample.

The rest were fed tologistic regression and the final model included the features presented in Table 5.41Table 5. Logistic Regression OutputFeatureEstimate Std. Error t-statP-value(Intercept)-0.56881 0.05002-11.371 <2e-16 ***trscr_RevolvingUtilizationOfUnsecuredLines0.733610.0431716.992<2e-16 ***trscr_age0.397500.082574.8141.59e-06 ***trscr_NumberOfTime3059DaysPastDueNotWorse 0.557700.055939.971<2e-16 ***trscr_NumberOfTime6089DaysPastDueNotWorse 0.448820.063737.0432.58e-12 ***After training the scorecard we applied query-based classification to the validation set.

The Query-Based Classification Algorithm defines number of iterations,alpha-level and subsample-size hyperparameters upon algorithm tuning. Finally,the algorithms were compared on a validation set by plotting ROC curves and calculating Gini coefficients achieved.Below we present Gini coefficients obtained for different combination of hyperparameters (grid search).42Table 6. Gini coefficients for the hyperparameters grid searchSubsample sizeNumber of iterations Alpha-threshold 1100200500234567890.1%0.64 0.65 0.57 ------0.2%0.63 0.65 0.62 ------0.3%0.60 0.65 0.63 ------0.4%0.60 0.64 0.62 ------0.5%0.58 0.64 0.63 0.58 0.34 ----1%0.55 0.63 0.63 0.61 0.55 ----2%0.45 0.60 0.63 0.61 0.61 ----5%0.49 0.52 0.60 0.63 0.62 0.61 0.60 0.57 0.5410%0.34 0.52 0.58 0.60 0.63 0.64 0.61 0.62 0.610.1%0.65 0.65 0.63 ------0.2%0.62 0.66 0.64 ------0.3%0.61 0.65 0.64 ------0.4%0.60 0.65 0.64 ------0.5%0.60 0.64 0.65 0.61 0.47 ----1%0.56 0.64 0.64 0.62 0.60 ----2%0.49 0.61 0.64 0.63 0.61 ----5%0.55 0.60 0.63 0.63 0.63 0.61 0.60 0.58 0.5610%0.39 0.54 0.59 0.62 0.64 0.64 0.62 0.62 0.620.1%0.66 0.66 0.64 ------0.2%0.64 0.66 0.65 ------0.3%0.63 0.66 0.65 ------0.4%0.61 0.66 0.65 ------0.5%0.65 0.64 0.58 ------10%0.41 0.55 0.60 0.63 0.65 0.65 0.65 0.63 0.64Table 7.

Experimental results: cross-validation and validation Gini coefficients for 3models. “Scorecard” stands for logistic regression with WOE-transformed features,and “QBCA” designates the query-based classification algorithm.metric \algo Scorecard QBCA XgboostValid. Gini0.5806430.66240.708Figure 7. ROC curves for QBCA (left), Scorecard (middle) and Xgboost (right).Finally, we applied the Xgboost2 gradient boosting algorithm to the same datato estimate the classification accuracy achievable with the “black-box” model. Thehyperparameters were tuned via 5-fold stratified cross-validation.

The results (crossvalidation and validation Gini) for 3 tested algorithms are given in Table 7. TheROC curves for validation set are presented in Fig. 7. As we can see, Xgboostperforms best in terms of Gini. However, its results are not interpretable, and the bestexplanation for classification that we one can extract from the trained Xgboost modelis the estimated feature importance, based on the number of times splits in trees weredone with each feature.In addition to this, we investigated how WOE-transformation of the featuresaffects the quality of the QBCA-model and some benchmark models.

In the table below prefix pat_ means that the model was trained on the three features only: NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse, NumberOfDependents.2https://github.com/dmlc/xgboost44Table 8. Experimental results: cross-validation Gini coefficients of the models thatwas trained on the features with and without WOE.classic_kneigh classic_logr classic_ada classic_gb classic_rfwith_WoE0.7568670.8631110.8619370.8651250.772045without_WoE 0.5349960.6743280.8193330.850830.778043pat_kneighpat_logrpat_adapat_gbpat_rfpat_struct0.5129280.7931170.8127210.805760.780680.78392without_WoE 0.6941970.8143520.799130.7968130.700930.803396with_WoEOn the contrary, it is interesting to realize that certain patterns can be extractedfrom the QBCA model.

We can observe rules such as if a loan applicant’s age isgreater than 50 and there was no delinquency in the past and the overall revolvingutilization of unsecured lines was less than 11%, then the probability of default isalmost 4 times lower than average.

On the other side applicants younger than 30and having revolving utilization of unsecured lines greater than 72% will default 1.5times more frequent than on average. This is where we enjoy the advantage of intervalpattern structures: they represent the rules that can be easily interpreted, and at thesame time they make prediction for each new object in validation dataset individually,which allows to improve classification accuracy over the default scorecard model.3.7QBCA. Alternative approachesQBCA with pre-training stepAs it was mentioned above, the steps of the Query-based algorithm are performed for each test object for positive and negative examples separately, producinga set of positive and negative α-weak premises.

Let’s suppose one positive α-weakpremise is extracted. One of the pitfalls is that the description of δ(g1 ) u . . . u δ(gk )can be too generic for objects gi from the set of positive examples (for example, ifone of the objects gi is an outlier, and it is very similar to an object from the set ofnegative examples).45In this case the pattern δ(g1 ) u . . .

u δ(gk ) will be falsified as long as (δ(g1 ) u. . . u δ(gk )) contains a lot of objects of the opposite class. Moreover, when adding atest object gτ , the description δ(g1 ) u . . . u δ(gk ) u δ(gτ ) would also be falsified.In connection with this, it is proposed beforehand to generate a given number ofpositive and negative α-weak premises δ(g1 ) u . . . u δ(gk ), and after that only add atest object to them and check the α-weak premise for stability. That is what we call apre-training step.The question is: how many generated α-weak premises do we need? To solvethis, it is suggested to slightly change the meaning of the hyperparameter number ofiterations and add a new hyperparameter epsilon.First of all we generate α-weak premises given number of iterations times. Let‘ssuppose we got Np stable α-weak premises.

Then the evaluation of the estimation ofprobability "to get a good α-weak premise" is equal to Np /number of iterations.Repeating this procedure we obtain a new probability estimate. Then we willrepeat it until the value of the probability estimate becomes stable, in other wordsuntil the difference of estimates on the two iterations becomes less than epsilon. Afterthat we generate the given sets of positive and negative α-weak premises, we will addto them a test object, to calculate the description δ(g1 ) u . .

. u δ(gk ) u δ(gτ ) and tocheck whether it’s still a premise. In the end we use voting schemes for classificationof an object.The key feature of this approach is that on the one hand, we need much moretime to generate enough steady positive and negative α-weak premises. On the otherhand we spend slightly less time for classification of objects as if we calculated thedescription δ(g1 ) u . . . u δ(gk ), the description δ(g1 ) u . .

. u δ(gk ) u δ(gτ ) will becalculated in O(k) of times faster.Below we present Gini coefficients obtained for different approaches (tested ofKaggle data).46Table 9. Comparison of two QBCA approachesnumber_of_iterations alpha subsample_size Gini_QBCA_classic Gini_QBCA_alter5000.001 10.6603940.65912320.67040.67003330.6454170.6742440.002 10.6404810.63943820.6651270.66713930.655430.6689160.003 10.6332130.63258620.6634570.66589730.6542240.6640870.004 10.6187630.62406920.6621230.66391930.6524180.6631780.005 30.6526980.66136340.6471740.66327650.5832730.6732370.006 30.6525420.65905540.6527010.65716850.6083660.669681According to the results one can see that alternative approach works better inmost cases.QBCA with target ratio based premisesAnother approach to mining α-weak premises is to account not only for certainthreshold for number of objects from a set of examples of an opposite class andrequire at least any objects of target class to satisfy the premise, but also, to accountfor target ratio within the premise.

Such idea demands to mine premises which hasa ratio of positive and negative examples considerably different comparing to ratioof positive and negative objects in initial data. In order to implement, the approach,the α-weak premise has to be redetermined. Alternative positive α-weak premise is47defined as an description d that satisfies:|G+ ||d ∩ G+ |>α·, α>1|d ∩ G− ||G− |Correspondingly negative α-weak premise is:|d ∩ G− ||G− |>α·, α>1|d ∩ G+ ||G+ |In order to test the proposed variation of initial approach we also examinedseveral voting schemes as discussed before.

Assessment of algorithm was done withGini coefficient based on test dataset for every voting scheme:• Gini1 - standard scheme. The number of positive and negative α-weak preisesand their difference is considered.• Gini2 - support based scheme. For positive α-weak rules, the number of positive objects is summed. For negative - the number of negative objects. Thedifference between two sums produces the margin or "probability" of belonging to the positive class.• Gini3 - full-support based scheme. The number of positive objects in both positive and negative α-weak premises is summed.

Характеристики

Тип файла

PDF-файл

Размер

3,66 Mb

Материал

Рандомизированные алгоритмы на основе интервальных узорных структур

Тип материала

Кандидатская диссертация

Предмет

Технические науки

Высшее учебное заведение

НИУ ВШЭ

Список файлов диссертации

randomizirovannye-algoritmy-na-osnove-intervalnyh-uzornyh-struktur.rar

Рандомизированные алгоритмы на основе интервальных узорных структур

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.