Building machine learning systems with Python (779436), страница 28
Текст из файла (страница 28)
TheL1 penalized model is often called the Lasso, while an L2 penalizedone is known as Ridge Regression. When using both, we call thisan ElasticNet model.Both the Lasso and the Ridge result in smaller coefficients than unpenalizedregression (smaller in absolute value, ignoring the sign). However, the Lasso has theadditional property that it results in many coefficients being set to exactly zero! Thismeans that the final model does not even use some of its input features, the modelis sparse. This is often a very desirable property as the model performs both featureselection and regression in a single step.You will notice that whenever we add a penalty, we also add a weight α, whichgoverns how much penalization we want. When α is close to zero, we are very close tounpenalized regression (in fact, if you set α to zero, you will simply perform OLS), andwhen α is large, we have a model that is very different from the unpenalized one.The Ridge model is older as the Lasso is hard to compute with pen and paper.However, with modern computers, we can use the Lasso as easily as Ridge, or evencombine them to form ElasticNets.
An ElasticNet has two penalties, one for theabsolute value and the other for the squares and it solves the following equation:This formula is a combination of the two previous ones, with two parameters, α1 andα2. Later in this chapter, we will discuss how to choose a good value for parameters.Using Lasso or ElasticNet in scikit-learnLet's adapt the preceding example to use ElasticNets. Using scikit-learn, it is veryeasy to swap in the ElasticNet regressor for the least squares one that we had before:>>> from sklearn.linear_model import ElasticNet, Lasso>>> en = ElasticNet(alpha=0.5)Now, we use en, whereas earlier we had used lr. This is the only change that isneeded. The results are exactly what we would have expected. The training errorincreases to 5.0 (it was 4.6 before), but the cross-validation error decreases to 5.4 (itwas 5.6 before).
We trade a larger error on the training data in order to gain bettergeneralization. We could have tried an L1 penalty using the Lasso class or L2 usingthe Ridge class with the same code.[ 165 ]RegressionVisualizing the Lasso pathUsing scikit-learn, we can easily visualize what happens as the value of theregularization parameter (alpha) changes.
We will again use the Boston data,but now we will use the Lasso regression object:>>> las = Lasso(normalize=1)>>> alphas = np.logspace(-5, 2, 1000)>>> alphas, coefs, _= las.path(x, y, alphas=alphas)For each value in alphas, the path method on the Lasso object returns thecoefficients that solve the lasso problem with that parameter value. Becausethe result changes smoothly with alpha, this can be computed very efficiently.A typical way to visualize this path is to plot the value of the coefficients as alphadecreases.
You can do so as follows:>>> fig,ax = plt.subplots()>>> ax.plot(alphas, coefs.T)>>> # Set log scale>>> ax.set_xscale('log')>>> # Make alpha decrease from left to right>>> ax.set_xlim(alphas.max(), alphas.min())This results in the following plot (we left out the trivial code that adds axis labels andthe title):[ 166 ]Chapter 7In this plot, the x axis shows decreasing amounts of regularization from left toright (alpha is decreasing).
Each line shows how a different coefficient varies asalpha changes. The plot shows that when using very strong regularization (leftside, very high alpha), the best solution is to have all values be exactly zero. As theregularization becomes weaker, one by one, the values of the different coefficientsfirst shoot up, then stabilize. At some point, they all plateau as we are probablyalready close to the unpenalized solution.P-greater-than-N scenariosThe title of this section is a bit of inside jargon, which you will learn now.Starting in the 1990s, first in the biomedical domain, and then on the Web, problemsstarted to appear where P was greater than N.
What this means is that the numberof features, P, was greater than the number of examples, N (these letters were theconventional statistical shorthand for these concepts). These became known as Pgreater than N problems.For example, if your input is a set of written documents, a simple way to approach itis to consider each possible word in the dictionary as a feature and regress on those(we will later work on one such problem ourselves).
In the English language, youhave over 20,000 words (this is if you perform some stemming and only considercommon words; it is more than ten times that if you skip this preprocessing step).If you only have a few hundred or a few thousand examples, you will have morefeatures than examples.In this case, as the number of features is greater than the number of examples, it ispossible to have a perfect fit on the training data. This is a mathematical fact, which isindependent of your data.
You are, in effect, solving a system of linear equations withfewer equations than variables. You can find a set of regression coefficients with zerotraining error (in fact, you can find more than one perfect solution, infinitely many).However, and this is a major problem, zero training error does not mean that yoursolution will generalize well. In fact, it may generalize very poorly. Whereas earlierregularization could give you a little extra boost, it is now absolutely required fora meaningful result.[ 167 ]RegressionAn example based on text documentsWe will now turn to an example that comes from a study performed at CarnegieMellon University by Prof. Noah Smith's research group.
The study was basedon mining the so-called 10-K reports that companies file with the Securities andExchange Commission (SEC) in the United States. This filing is mandated by law forall publicly traded companies. The goal of their study was to predict, based on thispiece of public information, what the future volatility of the company's stock will be.In the training data, we are actually using historical data for which we already knowwhat happened.There are 16,087 examples available. The features, which have already beenpreprocessed for us, correspond to different words, 150,360 in total.
Thus, we havemany more features than examples, almost ten times as much. In the introduction, itwas stated that ordinary least regression fails in these cases and we will now see whyby attempting to blindly apply it.The dataset is available in SVMLight format from multiple sources, including thebook's companion website.
This is a format that scikit-learn can read. SVMLight is,as the name says, a support vector machine implementation, which is also availablethrough scikit-learn; right now, we are only interested in the file format:>>> from sklearn.datasets import load_svmlight_file>>> data,target = load_svmlight_file('E2006.train')In the preceding code, data is a sparse matrix (that is, most of its entries are zerosand, therefore, only the nonzero entries are saved in memory), while the targetis a simple one-dimensional vector.
We can start by looking at some attributesof the target:>>> print('Min target value: {}'.format(target.min()))Min target value: -7.89957807347>>> print('Max target value: {}'.format(target.max()))Max target value: -0.51940952694>>> print('Mean target value: {}'.format(target.mean()))Mean target value: -3.51405313669>>> print('Std. dev. target: {}'.format(target.std()))Std. dev. target: 0.632278353911[ 168 ]Chapter 7So, we can see that the data lies between -7.9 and -0.5. Now that we have a feel forthe data, we can check what happens when we use OLS to predict. Note that wecan use exactly the same classes and methods as we did earlier:>>> from sklearn.linear_model import LinearRegression>>> lr = LinearRegression()>>> lr.fit(data,target)>>> pred = lr.predict(data)>>> rmse_train = np.sqrt(mean_squared_error(target, pred))>>> print('RMSE on training: {:.2}'.format(rmse_train))RMSE on training: 0.0025>>> print('R2 on training: {:.2}'.format(r2_score(target, pred)))R2 on training: 1.0The root mean squared error is not exactly zero because of rounding errors, butit is very close.
The coefficient of determination is 1.0. That is, the linear model isreporting a perfect prediction on its training data.When we use cross-validation (the code is very similar to what we used earlier in theBoston example), we get something very different: RMSE of 0.75, which correspondsto a negative coefficient of determination of -0.42. This means that if we always"predict" the mean value of -3.5, we do better than when using the regression model!Training and generalization errorWhen the number of features is greater than the number of examples,you always get zero training errors with OLS, except perhaps forissues due to rounding off. However, this is rarely a sign that yourmodel will do well in terms of generalization.
In fact, you may get zerotraining error and have a completely useless model.The natural solution is to use regularization to counteract the overfitting. We cantry the same cross-validation loop with an ElasticNet learner, having set the penaltyparameter to 0.1:>>> from sklearn.linear_model import ElasticNet>>> met = ElasticNet(alpha=0.1)>>> kf = KFold(len(target), n_folds=5)>>> pred = np.zeros_like(target)>>> for train, test in kf:...met.fit(data[train], target[train])[ 169 ]Regression...pred[test] = met.predict(data[test])>>> # Compute RMSE>>> rmse = np.sqrt(mean_squared_error(target, pred))>>> print('[EN 0.1] RMSE on testing (5 fold): {:.2}'.format(rmse))[EN 0.1] RMSE on testing (5 fold): 0.4>>> # Compute Coefficient of determination>>> r2 = r2_score(target, pred)>>> print('[EN 0.1] R2 on testing (5 fold): {:.2}'.format(r2))[EN 0.1] R2 on testing (5 fold): 0.61Now, we get 0.4 RMSE and an R2 of 0.61, much better than just predicting themean.