Building machine learning systems with Python (779436), страница 28

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 28 страницаBuilding machine learning systems with Python (779436) страница 282017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 28)

TheL1 penalized model is often called the Lasso, while an L2 penalizedone is known as Ridge Regression. When using both, we call thisan ElasticNet model.Both the Lasso and the Ridge result in smaller coefficients than unpenalizedregression (smaller in absolute value, ignoring the sign). However, the Lasso has theadditional property that it results in many coefficients being set to exactly zero! Thismeans that the final model does not even use some of its input features, the modelis sparse. This is often a very desirable property as the model performs both featureselection and regression in a single step.You will notice that whenever we add a penalty, we also add a weight α, whichgoverns how much penalization we want. When α is close to zero, we are very close tounpenalized regression (in fact, if you set α to zero, you will simply perform OLS), andwhen α is large, we have a model that is very different from the unpenalized one.The Ridge model is older as the Lasso is hard to compute with pen and paper.However, with modern computers, we can use the Lasso as easily as Ridge, or evencombine them to form ElasticNets.

An ElasticNet has two penalties, one for theabsolute value and the other for the squares and it solves the following equation:This formula is a combination of the two previous ones, with two parameters, α1 andα2. Later in this chapter, we will discuss how to choose a good value for parameters.Using Lasso or ElasticNet in scikit-learnLet's adapt the preceding example to use ElasticNets. Using scikit-learn, it is veryeasy to swap in the ElasticNet regressor for the least squares one that we had before:>>> from sklearn.linear_model import ElasticNet, Lasso>>> en = ElasticNet(alpha=0.5)Now, we use en, whereas earlier we had used lr. This is the only change that isneeded. The results are exactly what we would have expected. The training errorincreases to 5.0 (it was 4.6 before), but the cross-validation error decreases to 5.4 (itwas 5.6 before).

We trade a larger error on the training data in order to gain bettergeneralization. We could have tried an L1 penalty using the Lasso class or L2 usingthe Ridge class with the same code.[ 165 ]RegressionVisualizing the Lasso pathUsing scikit-learn, we can easily visualize what happens as the value of theregularization parameter (alpha) changes.

We will again use the Boston data,but now we will use the Lasso regression object:>>> las = Lasso(normalize=1)>>> alphas = np.logspace(-5, 2, 1000)>>> alphas, coefs, _= las.path(x, y, alphas=alphas)For each value in alphas, the path method on the Lasso object returns thecoefficients that solve the lasso problem with that parameter value. Becausethe result changes smoothly with alpha, this can be computed very efficiently.A typical way to visualize this path is to plot the value of the coefficients as alphadecreases.

You can do so as follows:>>> fig,ax = plt.subplots()>>> ax.plot(alphas, coefs.T)>>> # Set log scale>>> ax.set_xscale('log')>>> # Make alpha decrease from left to right>>> ax.set_xlim(alphas.max(), alphas.min())This results in the following plot (we left out the trivial code that adds axis labels andthe title):[ 166 ]Chapter 7In this plot, the x axis shows decreasing amounts of regularization from left toright (alpha is decreasing).

Each line shows how a different coefficient varies asalpha changes. The plot shows that when using very strong regularization (leftside, very high alpha), the best solution is to have all values be exactly zero. As theregularization becomes weaker, one by one, the values of the different coefficientsfirst shoot up, then stabilize. At some point, they all plateau as we are probablyalready close to the unpenalized solution.P-greater-than-N scenariosThe title of this section is a bit of inside jargon, which you will learn now.Starting in the 1990s, first in the biomedical domain, and then on the Web, problemsstarted to appear where P was greater than N.

What this means is that the numberof features, P, was greater than the number of examples, N (these letters were theconventional statistical shorthand for these concepts). These became known as Pgreater than N problems.For example, if your input is a set of written documents, a simple way to approach itis to consider each possible word in the dictionary as a feature and regress on those(we will later work on one such problem ourselves).

In the English language, youhave over 20,000 words (this is if you perform some stemming and only considercommon words; it is more than ten times that if you skip this preprocessing step).If you only have a few hundred or a few thousand examples, you will have morefeatures than examples.In this case, as the number of features is greater than the number of examples, it ispossible to have a perfect fit on the training data. This is a mathematical fact, which isindependent of your data.

You are, in effect, solving a system of linear equations withfewer equations than variables. You can find a set of regression coefficients with zerotraining error (in fact, you can find more than one perfect solution, infinitely many).However, and this is a major problem, zero training error does not mean that yoursolution will generalize well. In fact, it may generalize very poorly. Whereas earlierregularization could give you a little extra boost, it is now absolutely required fora meaningful result.[ 167 ]RegressionAn example based on text documentsWe will now turn to an example that comes from a study performed at CarnegieMellon University by Prof. Noah Smith's research group.

The study was basedon mining the so-called 10-K reports that companies file with the Securities andExchange Commission (SEC) in the United States. This filing is mandated by law forall publicly traded companies. The goal of their study was to predict, based on thispiece of public information, what the future volatility of the company's stock will be.In the training data, we are actually using historical data for which we already knowwhat happened.There are 16,087 examples available. The features, which have already beenpreprocessed for us, correspond to different words, 150,360 in total.

Thus, we havemany more features than examples, almost ten times as much. In the introduction, itwas stated that ordinary least regression fails in these cases and we will now see whyby attempting to blindly apply it.The dataset is available in SVMLight format from multiple sources, including thebook's companion website.

This is a format that scikit-learn can read. SVMLight is,as the name says, a support vector machine implementation, which is also availablethrough scikit-learn; right now, we are only interested in the file format:>>> from sklearn.datasets import load_svmlight_file>>> data,target = load_svmlight_file('E2006.train')In the preceding code, data is a sparse matrix (that is, most of its entries are zerosand, therefore, only the nonzero entries are saved in memory), while the targetis a simple one-dimensional vector.

We can start by looking at some attributesof the target:>>> print('Min target value: {}'.format(target.min()))Min target value: -7.89957807347>>> print('Max target value: {}'.format(target.max()))Max target value: -0.51940952694>>> print('Mean target value: {}'.format(target.mean()))Mean target value: -3.51405313669>>> print('Std. dev. target: {}'.format(target.std()))Std. dev. target: 0.632278353911[ 168 ]Chapter 7So, we can see that the data lies between -7.9 and -0.5. Now that we have a feel forthe data, we can check what happens when we use OLS to predict. Note that wecan use exactly the same classes and methods as we did earlier:>>> from sklearn.linear_model import LinearRegression>>> lr = LinearRegression()>>> lr.fit(data,target)>>> pred = lr.predict(data)>>> rmse_train = np.sqrt(mean_squared_error(target, pred))>>> print('RMSE on training: {:.2}'.format(rmse_train))RMSE on training: 0.0025>>> print('R2 on training: {:.2}'.format(r2_score(target, pred)))R2 on training: 1.0The root mean squared error is not exactly zero because of rounding errors, butit is very close.

The coefficient of determination is 1.0. That is, the linear model isreporting a perfect prediction on its training data.When we use cross-validation (the code is very similar to what we used earlier in theBoston example), we get something very different: RMSE of 0.75, which correspondsto a negative coefficient of determination of -0.42. This means that if we always"predict" the mean value of -3.5, we do better than when using the regression model!Training and generalization errorWhen the number of features is greater than the number of examples,you always get zero training errors with OLS, except perhaps forissues due to rounding off. However, this is rarely a sign that yourmodel will do well in terms of generalization.

In fact, you may get zerotraining error and have a completely useless model.The natural solution is to use regularization to counteract the overfitting. We cantry the same cross-validation loop with an ElasticNet learner, having set the penaltyparameter to 0.1:>>> from sklearn.linear_model import ElasticNet>>> met = ElasticNet(alpha=0.1)>>> kf = KFold(len(target), n_folds=5)>>> pred = np.zeros_like(target)>>> for train, test in kf:...met.fit(data[train], target[train])[ 169 ]Regression...pred[test] = met.predict(data[test])>>> # Compute RMSE>>> rmse = np.sqrt(mean_squared_error(target, pred))>>> print('[EN 0.1] RMSE on testing (5 fold): {:.2}'.format(rmse))[EN 0.1] RMSE on testing (5 fold): 0.4>>> # Compute Coefficient of determination>>> r2 = r2_score(target, pred)>>> print('[EN 0.1] R2 on testing (5 fold): {:.2}'.format(r2))[EN 0.1] R2 on testing (5 fold): 0.61Now, we get 0.4 RMSE and an R2 of 0.61, much better than just predicting themean.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.