Диссертация (1137066), страница 7

Файл №1137066 Диссертация (Рандомизированные алгоритмы на основе интервальных узорных структур) 7 страницаДиссертация (1137066) страница 72019-05-202019-05-20СтудИзба

Рандомизированные алгоритмы на основе интервальных узорных структур

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 7)

However, geometrical volume scheme can be tricky enough for weighting. Here is a little example. Consider 4-dimensional numeric feature space and twopremises:h1 =< [0.5, 1.5], [250, 255], [−6, −5], [0.69, 0.69] >h2 =< [0.0, 2.0], [100, 500], [−40, −1], [0.30, 0.30] >28. The geometrical volume for both premises equals zero as soon as the forth interval isa point. However, it is obvious that the first premise is more generic than the secondone. It is especially the case when dummy variables are added to the numericalfeatures.

Also, what seems to be critical, in one-versus-many situation the standalonepremise is likely two win as soon the total geometrical volume has to be lower. Inorder to tackle this problem one can modify the aggregating operator and considerthe average volume of premise:PA(h) =hdim hHowever, the issue with single point intervals is still in the place. Therefore, we introduce another weighting scheme based on the length of edges of the parallelepipedrather than its geometrical volume. The intuition stays the same – the wider are theintervals of a premise, the less weight the premise should receive.

We have to mention that the attributes usually have different scale and, similarly to k-neighbors, it isreasonable to normalize the feature space. After that, the changes in interval widthwill have comparable impact on the weight of premise. We consider linear normalization based on maximum and minimum values within the sample. So, for eachattribute X we have:Xnorm =X − min(X)max(X) − min(X)Having Xnorm ∈ [0, 1] we can define the weight of premise based on its total lengthof edges:ω(h) =M−PMhk=1 (bk− ahk )Mwhere M is the number of attributes, ahk , bhk are the left and the right border of thepremise normalized intervals correspondingly.

This form of weighting function hasseveral advantages. First, it is easy to calculate, second, in contrast to volume measure there is no issue concerning single point intervals, third, it is monotonic withintuitive bounds: zero weight corresponds to the premise with the min - and - maxintervals of the whole dataset, and unity weight corresponds to M-dimensional pointlike premises.

The voting rule with the summation aggregation operator takes the29following form:+ −−F (gtest , h+1 , ..., hp , h1 , ..., hn ) =Pph+h+iiXM− M(b−ak=1 kk ))⊗=(Mi=1PM h−jh−nXM − k=1 (bk − ak j ))⊗(Mj=1The comparing operator is preserved in its usual form:sign(b − a), if a 6= ba⊗b=∅,a=bThe possible short-coming is that generally the premises with “more cubic” architecture (with almost equal intervals) with the same total length of edges tend to havemore objects in their image comparing to “parallelepiped-like” premises. It is a sensible assumption as soon as their volume is higher. So, under the proposed measureit might be the case that two premises with same weight will have dramatically different support. That is why the proposed measure can be equipped with the numberof objects in the image of the premise.

So, the weight function will be redefined:PhhM− Mk=1 (bk − ak )ω(h) = |h|M+ −−F (gtest , h+1 , ..., hp , h1 , ..., hn ) =Pph+h+iiX + M − M(b−ak=1 kk )=(hi)⊗Mi=1Ph−h−njjX − M − M(b−ak=1 kk )⊗()hjMj=1We will focus on the voting scheme based on the total number of unique objectswithin the premises.303.5Experiments with Top-10 Bank DataIn this section, we test query-based classification algorithm on credit scoringdata of a top-10 Russian bank. The data we used for the computation represent thecustomers and their metrics assessed on the date of loan application.

The applicationswere approved by the bank credit policy and the clients were granted the loans. Afterthat the loans were observed for the fact of delinquency. The dataset is divided intotwo sets of positive and negative examples. The set of positive examples is the set ofloans where the target attribute is 1. The target attribute in credit scoring is typicallydefined as more than 90 days of delinquency within the first 12 months after the loanorigination. So, the set of positive examples is the set of bad borrowers, and the setof negative examples consists of good ones.

Each set of examples consists of 1000objects in order that voting scheme concerned in the second section was applicable.The test dataset consists of 300 objects and is extracted from the same population asthe positive and negative examples. Attributes represent various metrics such as loanamount, term, rate, payment-to-income ratio, age of the borrower, undocumentedto-documented income, credit history metrics etc. The set of attributes used for thequery-based classification contained 28 numerical attributes.In order to evaluate the accuracy of the classification we calculate the Gini coefficient for every combination of hyperparameters based on 300 predictions on the testset.

Gini coefficient is calculated based on the margin between the number of objectswithin positive premises and negative ones. In fact, the margin is the analog for thescore value in credit scorecards. Also, we provide the dynamics of the percentageof rejects from classifications. The abstain from classifications can arise in case ifthere was no premise found. This can be typical for low number of iterations andlow alpha threshold. In order to visualize the amount of rejections (Figure 1), we fixthe number of iterations (at the level of 2000) and analyze the dependency of rejectsfrom two other hyperparameters.When the subsample size is low, the intersections of the test object descriptionand positive (negative) examples tend to be more specific.

That is why, a relativelyhigh number of premises are mined and used for the classification. As subsample size31Figure 1. Rejection rate as a function of subsample size and alpha-threshold (withfixed number of iterations)increases, the candidates for premises start being generic and it is likely that thereexists certain amount of objects from the set of opposite examples which also satisfythe description. If alpha-threshold is low, the frequency of rejects from classificationis high. The dynamics of premise mining is demonstrated on the following graphs:32Figure 2. The dynamics of α - weak positive premises mining33Figure 3. The dynamics of negative α - weak premises miningThe average number of premises mined for a test object is dropping as expectedwith the increase in the subsample size and the drop is quicker for higher alphathresholds.

This supports the idea, that if lazy classification is run in its originalsetting upon the numerical data (i.e. when subsample size consists of only one object)the number of premises generated is close to the number of objects, so the premisescan be considered as too specific. The descriptive graphs above allows one to expectthat the proposed hyperparameters of the algorithm can be tuned (grid searched), soas to tackle the trade-off between the high number of premises used for classificationand the size of their support.

The average number of positive premises tends to fallslightly faster compared to negative premises.Below we present the classification accuracy obtained for different combinationof hyperparameters (grid search).34Table 1. Gini coefficients for the hyperparameters grid searchSubsample sizeAlpha-threshold Number of iterations 0.1% 0.2% 0.3% 0.4% 0.5% 0.6% 0.7% 0.8% 0.9%0.0%0.1%0.2%0.3%0.4%10040%44%39%18%1%0%0%0%0%15035%46%35%5%0%0%0%0%0%20042%37%36%12%5%1%0%0%0%50039%44%44%25%6%1%0%0%0%100044%47%44%41%11%3%0%0%0%200044%48%46%36%17%4%0%0%0%10033%37%40%40%44%43%34%32%34%15041%34%33%43%41%47%41%37%37%20040%40%34%42%51%43%44%41%36%50037%42%47%49%51%49%43%41%34%100037%42%46%48%49%48%43%43%37%200039%43%45%49%51%49%46%41%38%500043%40%44%49%46%50%48%38%36%10029%38%42%32%43%37%46%43%37%15027%42%41%41%36%47%48%45%41%20032%40%43%42%42%49%46%47%48%50039%46%46%48%47%48%51%48%51%100041%50%48%47%49%53%52%52%47%200038%48%50%48%47%53%52%53%50%10035%38%39%42%39%45%34%45%39%15027%43%44%42%42%39%37%40%46%20034%46%47%45%49%47%45%45%52%50031%45%49%50%49%46%50%51%47%100037%48%49%49%49%47%52%51%51%200038%46%48%51%51%50%50%52%52%500040%47%46%51%52%51%49%51%53%1000040%44%43%46%46%48%50%52%54%2000040%43%42%46%47%49%50%52%53%10028%39%44%48%43%50%53%42%49%15034%42%43%42%43%52%50%45%47%20033%46%43%47%51%49%49%42%45%50037%50%50%49%49%49%51%47%48%100040%48%50%50%51%52%50%48%50%200037%48%49%49%49%47%52%49%51%500039%42%42%43%45%47%49%52%49%35We observe the area with zero Gini coefficients where the alpha-threshold iszero and the subsample size is relatively high.

That is due to the fact that almost nopremises were mined during the query-based classification launch. It is quite intuitivebecause as the subsample size grows, the intersection of the subsample with a testobject results in a generic description, which is very likely to be falsified at least byone object from the set of opposite examples.

In this case the reject from classificationtakes place almost for all test objects. The first thing that is quite intuitive is that themore iterations are produced, the higher is the Gini on average:Figure 4. Average Gini grouped by the different number of iterations (over all otherhyperparameter values)The more times the subsamples are randomly extracted the more knowledge(in terms of premises) is generated.

Характеристики

Тип файла

PDF-файл

Размер

3,66 Mb

Материал

Рандомизированные алгоритмы на основе интервальных узорных структур

Тип материала

Кандидатская диссертация

Предмет

Технические науки

Высшее учебное заведение

НИУ ВШЭ

Список файлов диссертации

randomizirovannye-algoritmy-na-osnove-intervalnyh-uzornyh-struktur.rar

Рандомизированные алгоритмы на основе интервальных узорных структур

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.