Диссертация (1137066), страница 12

Файл №1137066 Диссертация (Рандомизированные алгоритмы на основе интервальных узорных структур) 12 страницаДиссертация (1137066) страница 122019-05-202019-05-20СтудИзба

Рандомизированные алгоритмы на основе интервальных узорных структур

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 12)

Also, the bank would prefer to maintainrelations with the client if financial distress is temporary. So, the decision whetherto launch default strategy or not is based to the greater extent on the recovery expectations. This makes the problem of recovery prediction crucial for banking decisionmaking. Recovery rate is a number between zero and one which reflect the shareof the current exposure which the client is going to payback on some time horizon.If recovery rate expectation is at high level, the bank would prefer restructuring andcourt launch otherwise.In this paper we use financial data from balance sheets and profit and loss statements of 612 corporate clients from the top-10 Russian bank.

Among others factorswe used assets-to-liabilities ratio, debt-to-equity ratio, earnings before taxes and interest payments, return on assets etc. These clients were assessed at the time of earlyinsolvency signals and the resulting recovery rate was collected.The data was randomly divided into two parts with 70% of observations in onepart and 30% in the other. The bigger part was used as a set of examples for thealgorithm and 30% was used as a test set to evaluate predictions and their accuracy.The same data partition was used to run random forests with different tunings with70% part used as a training set and the other as test set. For random forests therewere three parameters tuned by grid search which are minimum nodesize, number oftrees and number of feasible variables.The accuracy of predictions were evaluated in terms of mean absolute deviation(MAD):PNM AD =i=1 |yi− ŷi |Nwhere yi is a target attribute (recovery rate) for i-th client in the test set and ŷi isprediction.69The random forests were run with following parameters grid: minimum nodesize ranging from 30 to 100 with increment 10, number of trees ranging took values10, 30, 50 and 100, and number of feasible variables from ranging from 5 to 45 withincrement of 5.As far as query-based regression is concerned, we tuned seven parameters, fourof them were continuous and three were boolean.

Subsample size took followingvalues: 0.01, 0.02, 0.03, 0.04, 0.05, 0.1. Number of iterations: 100,500,1000,2000.Alpha threshold: 0,0.05,0.01,0.015,0.02. Allowed dropout: 0,0.1,0.5,1,1.5.For each combination of parameters we calculated MAD for the test set andin fact that produced metadata for the analysis. Effectively we obtained MAD distributions, which at the first step helped us to choose in favour of forecast based onweighted median forecast rather than weighted average as soon as MAD distributionsfor the latter took dramatically higher values which are, of course, undesirable.When building new algorithm one has some intuition about it mechanism andwe performed regression analysis of algorithm accuracy versus parameters values tocheck that intuition.

Also, the analysis was important to determine better parameterstuning and explain variation in accuracy of the predictions. The results of regressionare presented below:Table 10. Regression analysis for dependency between MAD and algorithm parametersCoefficientsEstimate Std.Error tp-value(Intercept)0,32880,0006519,40,0000Subsample size0,01550,00314,9400,0000Number of iterations-0,00040,0000-18,05 0,0000Alpha-threshold-0,04570,0270-1,695 0,0903Allowed dropout-0,00110,0004-2,975 0,0030Capped-0,00220,0004-5,401 0,0000Account for anti-support0,00020,00040,6240,5329Penalty for high deviation 0,00100,00042,4330,0150We see that increasing number of iterations, allowing dropouts and using capped70improve algorithm performance as soon as the coefficients are negative and significant: overall error of prediction decreases as those factors increase.

Surprisingly,adjusting account for anti-support and penalty for high deviation parameters do notshow significant improvement in accuracy. Also, we expected that there are somenon-linear dependencies between MAD and parameter values as soon as, intuitively,there has to be an optimal subsample size of randomly extracted objects. Therefore,we support the regression output with one-factor scatter plots with average MADacross all other iterations versus each parameter:Figure 25.

Single-factor analysis of average MAD versus parameter value: continuous and Boolean parametersAs expected, there is a local minimum for the subsample size being extractedfrom the knowledge base G. It is quite natural because as the subsample size grows,the intersection of the subsample with a test object results in a generic description,which is very likely to be falsified by objects with target attribute value out of thepremise description target range.71Figure 26. MAD distribution shows that query-based regression allows one to obtainprediction error relatively lower than the one with random forest tuningsAccording to performed grid search the range with the lowest MAD (0.247 0.290) on the test sample is achieved in following parameter area : alpha-threshold =1.5%, number of iterations = 10, subsample size = 1%, allowed dropout = 0.1. Theresult was compared to benchmarks represented by random forest tunings.Figure 27.

MAD distribution of query-based regression versus best tuning for randomforest and naive model MADSome benchmark algorithms for regression problem.72Table 11. Models adopted in the bankMAD valueNaive model (median value for all test objects) 0.35LinearRegression0.30DecisionTreeRegressor0.28Random Forest0.24AdaBoostRegressor0.28GradientBoostingRegressor0.25We applied the algorithm to delinquent corporate clients loans in order to predictthe recovery rate for each loan. The data we used comes from the pilot project withone of the top-10 banks in Russia. Mean absolute deviation was chosen as accuracymetric of the algorithm.

We performed simple grid search by running the algorithmwith different parameter values and chose the tuning with the lowest value of themetric.The classification accuracy of the algorithm was compared to some benchmarksrepresented by random forests, as soon as their predictions are based on combination of simple rules, too. The proposed query-based regression algorithm showedcomparable quality in the greater number of runs and in certain parameters area itoutperformed random forests. However, it has to be mentioned that the number ofparameters is greater in our algorithm what, in effect, results in greater algorithmcomplexity and greater degrees of freedom.As an area for further research, one can consider keeping the density function hnot only for target attribute in premises, but also make use of those density functionsfor explanatory attributes as well.

It can be expected, that if the premises are minednot only based on allowed dropout and alpha-threshold parameters, but also based onsome properties of attributes distribution, then the premises will be more relevant forthe test objects and will produce more accurate predictions for target attribute.735ConclusionThe key feature of risk management practice is that, regardless of the model ac-curacy, it must keep interpretability.

In this work we compared three basic approachesto modeling probability of default in the problem of credit scoring. The first was testing classical methods of scorecard, which is easily interpretable but provides limitedpredictive accuracy. The second, was query-based classification algorithm on intervalpattern structures, which provides higher predictive performance, and still keeps theinterpretability clear. The third, was a black-box algorithm represented by Xgboost,which showed best predictive ability but nevertheless did not allow one to extractinteresting client insights from the data. Therefore, we argue that FCA based classification algorithms can compete with ordinary statistical instruments adopted in banksand still provide the sets of rules which were relevant for particular loan applicant.Formal concept analysis offers attractive instruments to extract knowledge fromdata as soon as intents of concepts can be considered as associative rules.

FCA-basedalgorithms are suitable for predictive modeling in areas where model interpretationclarity is of great priority.In the forth section, we adjusted the query-based classification algorithm, so thatit can perform continuous predictions. The adjustment required a new definition ofan augmented interval pattern structure.

In effect, the adjusted algorithm mines thepremises (with target attribute expected distribution) that are relevant to test objectand then prediction is performed based on the target attribute distribution, e.g. basedon the median of the distribution.746Acknowledgements• First and foremost I want to thank my scientific advisor Doctor of Science prof.Sergei O.

Kuznetsov. It has been an honor to be his Ph.D. student. He hastaught me, both consciously and unconsciously, giving absolutely new ways tolook at common data analysis problems. I appreciate all his contributions oftime and ideas to make my Ph.D. experience productive;• This work could not be possible without Yury Kashnitsky who has given me asound introduction to the formal concept analysis tool set and who has becomeone of my co-authors and contributors to my work;• Also, I thank Alexander Ageev, who is currently a master student at HSE, whohas helped me a lot with additional data experiments and his own Python codealgorithms implementation;• I thank Ivan Medvedev and Evgeny Zinchenko, who have become my firstboss and mentor in area of risk-management at RCI Banque and who havesparked my interest in risk modeling;• I give special thanks to my current boss Roman Tikhonov, Head of Validationdepartment at Sberbank, who has provided me with an opportunity to devoteconsiderable time to Ph.D.

thesis including academic work and conferencesattendance;• Last but not the least, I would like to thank my family: my parents, my brotherand my sweetheart, due to their unconditional support and understanding oflack of attention from my side.75References[1] Edelman, D.B. and J.N.

Характеристики

Тип файла

PDF-файл

Размер

3,66 Mb

Материал

Рандомизированные алгоритмы на основе интервальных узорных структур

Тип материала

Кандидатская диссертация

Предмет

Технические науки

Высшее учебное заведение

НИУ ВШЭ

Список файлов диссертации

randomizirovannye-algoritmy-na-osnove-intervalnyh-uzornyh-struktur.rar

Рандомизированные алгоритмы на основе интервальных узорных структур

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.