Building machine learning systems with Python (779436), страница 6

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 6 страницаBuilding machine learning systems with Python (779436) страница 62017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 6)

In addition to the precedingplotting instructions, we simply add the following code:fx = sp.linspace(0,x[-1], 1000) # generate X-values for plottingplt.plot(fx, f1(fx), linewidth=4)plt.legend(["d=%i" % f1.order], loc="upper left")This will produce the following plot:It seems like the first 4 weeks are not that far off, although we clearly see that there issomething wrong with our initial assumption that the underlying model is a straightline. And then, how good or how bad actually is the error of 317,389,767.34?[ 19 ]Getting Started with Python Machine LearningThe absolute value of the error is seldom of use in isolation.

However, whencomparing two competing models, we can use their errors to judge which one ofthem is better. Although our first model clearly is not the one we would use, it servesa very important purpose in the workflow. We will use it as our baseline until wefind a better one. Whatever model we come up with in the future, we will compare itagainst the current baseline.Towards some advanced stuffLet's now fit a more complex model, a polynomial of degree 2, to see whether itbetter understands our data:>>> f2p = sp.polyfit(x, y, 2)>>> print(f2p)array([1.05322215e-02,-5.26545650e+00,>>> f2 = sp.poly1d(f2p)>>> print(error(f2, x, y))179983507.878You will get the following plot:[ 20 ]1.97476082e+03])The error is 179,983,507.878, which is almost half the error of the straight line model.This is good but unfortunately this comes with a price: We now have a more complexfunction, meaning that we have one parameter more to tune inside polyfit().

Thefitted polynomial is as follows:f(x) = 0.0105322215 * x**2- 5.26545650 * x + 1974.76082So, if more complexity gives better results, why not increase the complexity evenmore? Let's try it for degrees 3, 10, and 100.Interestingly, we do not see d=53 for the polynomial that had been fitted with100 degrees. Instead, we see lots of warnings on the console:RankWarning: Polyfit may be poorly conditionedThis means because of numerical errors, polyfit cannot determine a good fit with100 degrees. Instead, it figured that 53 must be good enough.[ 21 ]Getting Started with Python Machine LearningIt seems like the curves capture and better the fitted data the more complex they get.And also, the errors seem to tell the same story:Error d=1: 317,389,767.339778Error d=2: 179,983,507.878179Error d=3: 139,350,144.031725Error d=10: 121,942,326.363461Error d=53: 109,318,004.475556However, taking a closer look at the fitted curves, we start to wonder whether they alsocapture the true process that generated that data.

Framed differently, do our modelscorrectly represent the underlying mass behavior of customers visiting our website?Looking at the polynomial of degree 10 and 53, we see wildly oscillating behavior. Itseems that the models are fitted too much to the data. So much that it is now capturingnot only the underlying process but also the noise.

This is called overfitting.At this point, we have the following choices:• Choosing one of the fitted polynomial models.• Switching to another more complex model class. Splines?• Thinking differently about the data and start again.Out of the five fitted models, the first order model clearly is too simple, and themodels of order 10 and 53 are clearly overfitting. Only the second and third ordermodels seem to somehow match the data.

However, if we extrapolate them at bothborders, we see them going berserk.Switching to a more complex class seems also not to be the right way to go. Whatarguments would back which class? At this point, we realize that we probably havenot fully understood our data.Stepping back to go forward – another look at ourdataSo, we step back and take another look at the data. It seems that there is an inflectionpoint between weeks 3 and 4.

So let's separate the data and train two lines usingweek 3.5 as a separation point:inflection = 3.5*7*24 # calculate the inflection point in hoursxa = x[:inflection] # data before the inflection pointya = y[:inflection]xb = x[inflection:] # data after[ 22 ]yb = y[inflection:]fa = sp.poly1d(sp.polyfit(xa, ya, 1))fb = sp.poly1d(sp.polyfit(xb, yb, 1))fa_error = error(fa, xa, ya)fb_error = error(fb, xb, yb)print("Error inflection=%f" % (fa_error + fb_error))Error inflection=132950348.197616From the first line, we train with the data up to week 3, and in the second line wetrain with the remaining data.Clearly, the combination of these two lines seems to be a much better fit to the datathan anything we have modeled before.

But still, the combined error is higher thanthe higher order polynomials. Can we trust the error at the end?[ 23 ]Getting Started with Python Machine LearningAsked differently, why do we trust the straight line fitted only at the last week of ourdata more than any of the more complex models? It is because we assume that it willcapture future data better.

If we plot the models into the future, we see how right weare (d=1 is again our initial straight line).The models of degree 10 and 53 don't seem to expect a bright future of ourstart-up. They tried so hard to model the given data correctly that they are clearlyuseless to extrapolate beyond. This is called overfitting. On the other hand, thelower degree models seem not to be capable of capturing the data good enough.This is called underfitting.[ 24 ]So let's play fair to models of degree 2 and above and try out how they behave if wefit them only to the data of the last week. After all, we believe that the last week saysmore about the future than the data prior to it.

The result can be seen in the followingpsychedelic chart, which further shows how badly the problem of overfitting is.Still, judging from the errors of the models when trained only on the data fromweek 3.5 and later, we still should choose the most complex one (note that we alsocalculate the error only on the time after the inflection point):Error d=1:22,143,941.107618Error d=2:19,768,846.989176Error d=3:19,766,452.361027Error d=10:18,949,339.348539Error d=53:18,300,702.038119[ 25 ]Getting Started with Python Machine LearningTraining and testingIf we only had some data from the future that we could use to measure our modelsagainst, then we should be able to judge our model choice only on the resultingapproximation error.Although we cannot look into the future, we can and should simulate a similar effectby holding out a part of our data.

Let's remove, for instance, a certain percentage ofthe data and train on the remaining one. Then we used the held-out data to calculatethe error. As the model has been trained not knowing the held-out data, we shouldget a more realistic picture of how the model will behave in the future.The test errors for the models trained only on the time after inflection point nowshow a completely different picture:Error d=1: 6397694.386394Error d=2: 6010775.401243Error d=3: 6047678.658525Error d=10: 7037551.009519Error d=53: 7052400.001761Have a look at the following plot:[ 26 ]It seems that we finally have a clear winner: The model with degree 2 has the lowesttest error, which is the error when measured using data that the model did not seeduring training.

And this gives us hope that we won't get bad surprises when futuredata arrives.Answering our initial questionFinally we have arrived at a model which we think represents the underlyingprocess best; it is now a simple task of finding out when our infrastructure willreach 100,000 requests per hour. We have to calculate when our model functionreaches the value 100,000.Having a polynomial of degree 2, we could simply compute the inverse of thefunction and calculate its value at 100,000.

Of course, we would like to have anapproach that is applicable to any model function easily.This can be done by subtracting 100,000 from the polynomial, which results inanother polynomial, and finding its root. SciPy's optimize module has the functionfsolve that achieves this, when providing an initial starting position with parameterx0. As every entry in our input data file corresponds to one hour, and we have 743 ofthem, we set the starting position to some value after that. Let fbt2 be the winningpolynomial of degree 2.>>> fbt2 = sp.poly1d(sp.polyfit(xb[train], yb[train], 2))>>> print("fbt2(x)= \n%s" % fbt2)fbt2(x)=20.086 x - 94.02 x + 2.744e+04>>> print("fbt2(x)-100,000= \n%s" % (fbt2-100000))fbt2(x)-100,000=20.086 x - 94.02 x - 7.256e+04>>> from scipy.optimize import fsolve>>> reached_max = fsolve(fbt2-100000, x0=800)/(7*24)>>> print("100,000 hits/hour expected at week %f" % reached_max[0])It is expected to have 100,000 hits/hour at week 9.616071, so our model tellsus that, given the current user behavior and traction of our start-up, it willtake another month until we have reached our capacity threshold.[ 27 ]Getting Started with Python Machine LearningOf course, there is a certain uncertainty involved with our prediction.

To get a realpicture of it, one could draw in more sophisticated statistics to find out about thevariance we have to expect when looking farther and farther into the future.And then there are the user and underlying user behavior dynamics that we cannotmodel accurately. However, at this point, we are fine with the current predictions.After all, we can prepare all time-consuming actions now. If we then monitor ourweb traffic closely, we will see in time when we have to allocate new resources.SummaryCongratulations! You just learned two important things, of which the mostimportant one is that as a typical machine learning operator, you will spend most ofyour time in understanding and refining the data—exactly what we just did in ourfirst tiny machine learning example.

And we hope that this example helped you tostart switching your mental focus from algorithms to data. Then you learned howimportant it is to have the correct experiment setup and that it is vital to not mix uptraining and testing.Admittedly, the use of polynomial fitting is not the coolest thing in the machinelearning world. We have chosen it to not distract you by the coolness of someshiny algorithm when we conveyed the two most important messages we justsummarized earlier.So, let's move to the next chapter in which we will dive deep into scikit-learn, themarvelous machine learning toolkit, give an overview of different types of learning,and show you the beauty of feature engineering.[ 28 ]Classifying with Real-worldExamplesThe topic of this chapter is classification.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.