Building machine learning systems with Python (779436), страница 8

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 8 страницаBuilding machine learning systems with Python (779436) страница 82017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 8)

Thatis, we can see which feature values will result in one decision versus the other andexactly where the boundary is. In the following screenshot, we see two regions: oneis white and the other is shaded in grey. Any datapoint that falls on the white regionwill be classified as Iris Virginica, while any point that falls on the shaded side willbe classified as Iris Versicolor.[ 35 ]Classifying with Real-world ExamplesIn a threshold model, the decision boundary will always be a line that is parallel toone of the axes. The plot in the preceding screenshot shows the decision boundary andthe two regions where points are classified as either white or grey.

It also shows (as adashed line) an alternative threshold, which will achieve exactly the same accuracy.Our method chose the first threshold it saw, but that was an arbitrary choice.Evaluation – holding out data and cross-validationThe model discussed in the previous section is a simple model; it achieves 94 percentaccuracy of the whole data. However, this evaluation may be overly optimistic. Weused the data to define what the threshold will be, and then we used the same datato evaluate the model. Of course, the model will perform better than anything elsewe tried on this dataset. The reasoning is circular.What we really want to do is estimate the ability of the model to generalize to newinstances. We should measure its performance in instances that the algorithm hasnot seen at training.

Therefore, we are going to do a more rigorous evaluation anduse held-out data. For this, we are going to break up the data into two groups: onone group, we'll train the model, and on the other, we'll test the one we held outof training. The full code, which is an adaptation of the code presented earlier, isavailable on the online support repository. Its output is as follows:Training accuracy was 96.0%.Testing accuracy was 90.0% (N = 50).The result on the training data (which is a subset of the whole data) is apparentlyeven better than before. However, what is important to note is that the result inthe testing data is lower than that of the training error. While this may surprise aninexperienced machine learner, it is expected that testing accuracy will be lower thanthe training accuracy.

To see why, look back at the plot that showed the decisionboundary. Consider what would have happened if some of the examples close to theboundary were not there or that one of them between the two lines was missing. It iseasy to imagine that the boundary will then move a little bit to the right or to the leftso as to put them on the wrong side of the border.The accuracy on the training data, the training accuracy, is almostalways an overly optimistic estimate of how well your algorithm isdoing. We should always measure and report the testing accuracy,which is the accuracy on a collection of examples that were not usedfor training.[ 36 ]These concepts will become more and more important as the models becomemore complex.

In this example, the difference between the accuracy measured ontraining data and on testing data is not very large. When using a complex model,it is possible to get 100 percent accuracy in training and do no better than randomguessing on testing!One possible problem with what we did previously, which was to hold out datafrom training, is that we only used half the data for training.

Perhaps it wouldhave been better to use more training data. On the other hand, if we then leave toolittle data for testing, the error estimation is performed on a very small number ofexamples. Ideally, we would like to use all of the data for training and all of the datafor testing as well, which is impossible.We can achieve a good approximation of this impossible ideal by a method calledcross-validation. One simple form of cross-validation is leave-one-out cross-validation.We will take an example out of the training data, learn a model without thisexample, and then test whether the model classifies this example correctly.This process is then repeated for all the elements in the dataset.The following code implements exactly this type of cross-validation:>>> correct = 0.0>>> for ei in range(len(features)):# select all but the one at position `ei`:training = np.ones(len(features), bool)training[ei] = Falsetesting = ~trainingmodel = fit_model(features[training], is_virginica[training])predictions = predict(model, features[testing])correct += np.sum(predictions == is_virginica[testing])>>> acc = correct/float(len(features))>>> print('Accuracy: {0:.1%}'.format(acc))Accuracy: 87.0%At the end of this loop, we will have tested a series of models on all the examplesand have obtained a final average result.

When using cross-validation, there is nocircularity problem because each example was tested on a model which was builtwithout taking that datapoint into account. Therefore, the cross-validated estimateis a reliable estimate of how well the models would generalize to new data.The major problem with leave-one-out cross-validation is that we are now forced toperform many times more work. In fact, you must learn a whole new model for eachand every example and this cost will increase as our dataset grows.[ 37 ]Classifying with Real-world ExamplesWe can get most of the benefits of leave-one-out at a fraction of the cost by usingx-fold cross-validation, where x stands for a small number.

For example, to performfive-fold cross-validation, we break up the data into five groups, so-called five folds.Then you learn five models: each time you will leave one fold out of the trainingdata. The resulting code will be similar to the code given earlier in this section, butwe leave 20 percent of the data out instead of just one element.

We test each of thesemodels on the left-out fold and average the results.The preceding figure illustrates this process for five blocks: the dataset is split intofive pieces. For each fold, you hold out one of the blocks for testing and train on theother four. You can use any number of folds you wish. There is a trade-off betweencomputational efficiency (the more folds, the more computation is necessary) andaccurate results (the more folds, the closer you are to using the whole of the data fortraining). Five folds is often a good compromise. This corresponds to training with 80percent of your data, which should already be close to what you will get from usingall the data.

If you have little data, you can even consider using 10 or 20 folds. In theextreme case, if you have as many folds as datapoints, you are simply performingleave-one-out cross-validation. On the other hand, if computation time is an issueand you have more data, 2 or 3 folds may be the more appropriate choice.When generating the folds, you need to be careful to keep them balanced. Forexample, if all of the examples in one fold come from the same class, then the resultswill not be representative. We will not go into the details of how to do this, becausethe machine learning package scikit-learn will handle them for you.[ 38 ]We have now generated several models instead of just one.

So, "What final modeldo we return and use for new data?" The simplest solution is now to train a singleoverall model on all your training data. The cross-validation loop gives you anestimate of how well this model should generalize.A cross-validation schedule allows you to use all your datato estimate whether your methods are doing well. At the endof the cross-validation loop, you can then use all your data totrain a final model.Although it was not properly recognized when machine learning was starting out asa field, nowadays, it is seen as a very bad sign to even discuss the training accuracyof a classification system.

This is because the results can be very misleading and evenjust presenting them marks you as a newbie in machine learning. We always wantto measure and compare either the error on a held-out dataset or the error estimatedusing a cross-validation scheme.Building more complex classifiersIn the previous section, we used a very simple model: a threshold on a single feature.Are there other types of systems? Yes, of course! Many others. Throughout thisbook, you will see many other types of models and we're not even going to covereverything that is out there.To think of the problem at a higher abstraction level, "What makes up a classificationmodel?" We can break it up into three parts:• The structure of the model: How exactly will a model make decisions?In this case, the decision depended solely on whether a given feature wasabove or below a certain threshold value. This is too simplistic for all butthe simplest problems.• The search procedure: How do we find the model we need to use? In ourcase, we tried every possible combination of feature and threshold.

You caneasily imagine that as models get more complex and datasets get larger, itrapidly becomes impossible to attempt all combinations and we are forcedto use approximate solutions. In other cases, we need to use advancedoptimization methods to find a good solution (fortunately, scikit-learnalready implements these for you, so using them is easy even if the codebehind them is very advanced).[ 39 ]Classifying with Real-world Examples• The gain or loss function: How do we decide which of the possibilitiestested should be returned? Rarely do we find the perfect solution, the modelthat never makes any mistakes, so we need to decide which one to use. Weused accuracy, but sometimes it will be better to optimize so that the modelmakes fewer errors of a specific kind.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.