Диссертация (1137066), страница 10

Файл №1137066 Диссертация (Рандомизированные алгоритмы на основе интервальных узорных структур) 10 страницаДиссертация (1137066) страница 102019-05-202019-05-20СтудИзба

Рандомизированные алгоритмы на основе интервальных узорных структур

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 10)

Then the number of negativeobjects is summed. The difference between two sums produces the margin or"probability" of belonging to the positive class• Gini4 - ratio based scheme. This voting scheme is the same as in 3, but at theend margin or "probability" is found as ratio of positive objects to total numberof objects within α-weak premises.Below in Figure 8 and others one can find heatmaps that are produced by thousands of launches of algorithm with different voting schemes.

Each cell of a heatmaprepresents Gini value on test dataset under certain hyperparameters values: subsample size (columns) and α-threshold (rows). It can be observed that there is certainarea of hyperparameters that allow one to reach optimal Gini on the test sample. The48maximum Gini levels vary from 61% to 65% within different voting schemes. It hasto be mentioned that for ratio based voting it is optimal to use small subsample sizeat mining step, while for absolute differences based voting subsample size has to beat considerable level.Also, it is found that appropriate subsample size delivers maximum to Gini estimations which is demonstrated on average estimates in figures 12QBCA with uniform samplingIn query-based classification algorithm we previously fixed a subsample sizehyperparameter within each launch.

One can argue that this actually limits our randomization procedure as soon as we restrict ourselves to mine only for premises thatare produced by this very number of objects, i.e. subsample size value.It may be useful to ease such restriction and allow subsample size to vary withineach launch. However, as soon as we vary this hyperparameter among several launchesthe effect is not expected to be dramatical.

Below comparison of two results are provided: one with fixed hyperparameter within each launch, and the other so calleduniform sampling.When mining α-weak premises one performs equal amount of iterations with1, 2, ..., s number of objects extracted from the set of positive (negative) examples,where s is subsample size value.Below in Figure 15 we see that adjusted procedure does not give systematicimprovement in clasiffication accuracy, while it requires more computational time23.

Also, it can be seen from Gini heatmap that uniform sampling averages theperformance along with subsample size hyperparameter.3.8Interpretability: visualization of premisesWhen considering whether the algorithm is interpretable or not we have to out-line two important properties of interpretability:1. Prediction is performed based on rules derived from initial factors preserving49Figure 8.

Gini Heatmap (1st voting scheme)50Figure 9. Gini Heatmap (2nd voting scheme)51Figure 10. Gini Heatmap (3rd voting scheme)52Figure 11. Gini Heatmap (4th voting scheme)53Figure 12. Average Gini by subsample sizeFigure 13. Maximum Gini by subsample size54Figure 14. Gini Heatmap based on number of premises (uniform samplings)55Figure 15. Maximum Gini by subsample size (different samplings)initial feature space2. The algorithm processes initially defined target attributeFor example, SVM with kernels does not have the first property as soon as classification is performed in artificially constructed feature space.

Also, XGBoost lacks thesecond property as soon as each next tree fits the errors of previous one, which is notthe initially defined target attribute of object. Neural networks do not provide rulesfor a decision maker. So all these examples of algorithms cannot be interpreted.The situation is different with query-based classification algorithm as soon as itworks with premises.Premises generated on numerical features are sets of intervals. In fact, they define an area in initial feature space and serve as rules for a decision maker.

Therefore,premise can be visualized as a hypercube in a space of dimension d, where d is thenumber of intervals (and features). To visualize the premise, one can make the projection of this hypercube on the plane that is, we will consider only two features foreach premise. To visualize the premises, Kaggle data was used, and the two featuresRevolvingU tilizationOf U nsecuredLines and Age were taken.56Figure 16. Random positive and negative premisesConsider the undetermined object gτ , for which the values of all features fallwithin the alpha-positive premise’s interval’s.

In this case, none of the intervals expands. Thus, if the values of the features are inside of intervals of the positive α-weakpremise, then with some confidence, this object can be considered positive. Similarlyfor negative premises.Figure 16 shows two positive premise and two negative premises on the twofeatures plane. Positive premises are depicted in red, and negative are in blue.

Toconstruct each positive premise, two objects from the set of positive examples wererandomly extracted. Then the meet-operator was applied and a set of intervals wasobtained. After that, only the intervals for two features were left. The same algorithmwas performed for negative premises and a set of negative examples.In Figure 17 there were five positive premise and five negative premises, builtaccording to the same algorithm. One can see that the positive and negative premisesare localized in different areas. As long as we extract more random premises theboundary between good and bad regions becomes more and more obvious.For 1000 random premises, the boundary is almost clear (see Figure 20 ). Aslong as number of extracted negative and positive premises increases the number ofmultiple intersections between them grows.

One can observe an expansion of thearea with sparse positive premises, while negative premises are fixed at interval from0 to 1, positive intervals are in the range from 0.4 to 2.57Figure 17. Random positive and negative premisesFigure 18. Random positive and negative premisesFigure 19. Random positive and negative premises58Figure 20. Random positive and negative premisesIn addition it is possible to see disputable areas (depicted in purple), that isthose areas of values of features which get and in positive, and at negative intervals. In addition to this, one can see that for some premises, the right border ofthe RevolvingU tilizationOf U nsecuredLines feature’s interval is 1.

But somepremises have a right-hand boundary of more than 1. Based on this one can makea conclusion about data errors or heterogeneity of the values of a given feature (thepremises were constructed on data without preprocessing). Thus, such visualizationhas an additional practical value for decision making expert.3.9Computational time analysisDecision time might be crucial provided there is large amount of incoming data.Therefore, we performed additional "anytime decision" analysis to explore incremental classification accuracy change, due to gradual increase in computational time, i.e.number of iterations.

For this analysis we used hyperparameters which showed thebest result during the grid search: alpha = 0.001, subsample_size = 2. We run thealgorithm with different values of the parameter number_of _iterations and look atthe variation of the Gini coefficient. In background, computational time is measured.The procedure was also repeated for another value of the α threshold (for alpha=0.1)in order to have some benchmark. The results of the calculations are shown in Figures 21.59Figure 21. Gini coefficient dynamicsFigure 22.

QBCA. Time required for computationAs long as the number of iterations grows, Gini coefficient rapidly increases.After a certain point, the growth rate decreases and the quality of the classificationgrows slightly. The calculation time linearly depends on the number of iterations(Figures 22).We also performed decision time analysis per 1 test object.

In case of largeinflow of credit applicants it might be important to make quicker rather than the mostaccurate prediction from business standpoint.From the Figure 24 we can see that on average time required for classificationvaries from 0.5 to 1 second per 1 object. The machine parameters are 755 GB RAM,Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz.60Figure 23. Average time required to produce classification per 1 object (mins)It means that with the inflow of 10 000 applications the system will finalize predictions within approximately 2 hours.

It has to be admitted that scorecards, decisiontrees and simple logistic regressions are a lot quicker when applying to incomingobjects. For example the data set of 10 000 applicants can be scored (dependent onnumber of factors included in the model) in 0.25-0.75 seconds. It is understandable assoon as the model once fitted is applied in the same manner to all new incoming objects. In contrast, proposed query-based classification algorithm produces predictionsindividually for each new object, e.g.

client, mining relevant rules for the object.Nevertheless, the classification via query-based approach can be easily paralleled which dramatically increases computational capacity. Also, as an area for further research one can benefit from pre-training step discussed in previous subsectionwhich allows decision maker to save time when applying pre-trained premises toincoming objects online.The worst-case complexity estimate for randomized interval pattern structurealgorithm is O((p(u) · (s − 1) + p()) · n) where p(u) is time needed for intersectioncalculation of two descriptions, s is a number of objects in subsample, p() is timeneeded for image calculation for description, and n is a number of iterations.

Характеристики

Тип файла

PDF-файл

Размер

3,66 Mb

Материал

Рандомизированные алгоритмы на основе интервальных узорных структур

Тип материала

Кандидатская диссертация

Предмет

Технические науки

Высшее учебное заведение

НИУ ВШЭ

Список файлов диссертации

randomizirovannye-algoritmy-na-osnove-intervalnyh-uzornyh-struktur.rar

Рандомизированные алгоритмы на основе интервальных узорных структур

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.