Building machine learning systems with Python (779436), страница 27
Текст из файла (страница 27)
This regularity in the API is one of the nicer features of scikit-learn.[ 158 ]Chapter 7The preceding graph shows all the points (as dots) and our fit (the solid line). We cansee that visually it looks good, except for a few outliers.Ideally, though, we would like to measure how good of a fit this is quantitatively.This will be critical in order to be able to compare alternative methods.
To do so, wecan measure how close our prediction is to the true values. For this task, we can usethe mean_squared_error function from the sklearn.metrics module:>>> from sklearn.metrics import mean_squared_errorThis function takes two arguments, the true value and the predictions, as follows:>>> mse = mean_squared_error(y, lr.predict(x))>>> print("Mean squared error (of training data): {:.3}".format(mse))Mean squared error (of training data): 58.4This value can sometimes be hard to interpret, and it's better to take the square root,to obtain the root mean square error (RMSE):>>> rmse = np.sqrt(mse)>>> print("RMSE (of training data): {:.3}".format(rmse))RMSE (of training data): 6.6[ 159 ]RegressionOne advantage of using RMSE is that we can quickly obtain a very rough estimate ofthe error by multiplying it by two.
In our case, we can expect the estimated price tobe different from the real price by, at most, 13 thousand dollars.Root mean squared error and predictionRoot mean squared error corresponds approximately to an estimateof the standard deviation. Since most data is at most two standarddeviations from the mean, we can double our RMSE to obtain a roughconfident interval. This is only completely valid if the errors arenormally distributed, but it is often roughly correct even if they are not.A number such as 6.6 is still hard to immediately intuit. Is this a good prediction?One possible way to answer this question is to compare it with the most simplebaseline, the constant model.
If we knew nothing of the input, the best we coulddo is predict that the output will always be the average value of y. We can thencompare the mean-squared error of this model with the mean-squared error of thenull model. This idea is formalized in the coefficient of determination, which isdefined as follows:In this formula, yi represents the value of the element with index i, while yi is theestimate for the same element obtained by the regression model. Finally, y is themean value of y, which represents the null model that always returns the same value.This is roughly the same as first computing the ratio of the mean squared error withthe variance of the output and, finally, considering one minus this ratio.
This way,a perfect model obtains a score of one, while the null model obtains a score of zero.Note that it is possible to obtain a negative score, which means that the model is sopoor that one is better off using the mean as a prediction.The coefficient of determination can be obtained using r2_score of the sklearn.metrics module:>>> from sklearn.metrics import r2_score>>> r2 = r2_score(y, lr.predict(x))>>> print("R2 (on training data): {:.2}".format(r2))R2 (on training data): 0.31[ 160 ]Chapter 7This measure is also called the R² score. If you are using linear regression andevaluating the error on the training data, then it does correspond to the squareof the correlation coefficient, R. However, this measure is more general, and aswe discussed, may even return a negative value.An alternative way to compute the coefficient of determination is to use the scoremethod of the LinearRegression object:>>> r2 = lr.score(x,y)Multidimensional regressionSo far, we have only used a single variable for prediction, the number ofrooms per dwelling.
We will now use all the data we have to fit a model, usingmultidimensional regression. We now try to predict a single output (the averagehouse price) based on multiple inputs.The code looks very much like before. In fact, it's even simpler as we can now passthe value of boston.data directly to the fit method:>>> x = boston.data>>> y = boston.target>>> lr.fit(x, y)Using all the input variables, the root mean squared error is only 4.7, whichcorresponds to a coefficient of determination of 0.74.
This is better than what wehad before, which indicates that the extra variables did help. We can no longer easilydisplay the regression line as we did, because we have a 14-dimensional regressionhyperplane instead of a single line.We can, however, plot the prediction versus the actual value. The code is as follows:>>> p = lr.predict(x)>>> plt.scatter(p, y)>>> plt.xlabel('Predicted price')>>> plt.ylabel('Actual price')>>> plt.plot([y.min(), y.max()], [[y.min()], [y.max()]])[ 161 ]RegressionThe last line plots a diagonal line that corresponds to perfect agreement. This aidswith visualization.
The results are shown in the following plot, where the solid lineshows the diagonal (where all the points would lie if there was perfect agreementbetween the prediction and the underlying value):Cross-validation for regressionIf you remember when we first introduced classification, we stressed the importanceof cross-validation for checking the quality of our predictions.
In regression, this isnot always done. In fact, we discussed only the training error in this chapter so far.This is a mistake if you want to confidently infer the generalization ability. Sinceordinary least squares is a very simple model, this is often not a very serious mistake.In other words, the amount of overfitting is slight. However, we should still test thisempirically, which we can easily do with scikit-learn.We will use the Kfold class to build a 5 fold cross-validation loop and test thegeneralization ability of linear regression:>>> from sklearn.cross_validation import Kfold>>> kf = KFold(len(x), n_folds=5)>>> p = np.zeros_like(y)>>> for train,test in kf:[ 162 ]Chapter 7...lr.fit(x[train], y[train])...p[test] = lr.predict(x[test])>>> rmse_cv = np.sqrt(mean_squared_error(p, y))>>> print('RMSE on 5-fold CV: {:.2}'.format(rmse_cv))RMSE on 5-fold CV: 5.6With cross-validation, we obtain a more conservative estimate (that is, the error islarger): 5.6.
As in the case of classification, the cross-validation estimate is a betterestimate of how well we could generalize to predict on unseen data.Ordinary least squares is fast at learning time and returns a simple model, whichis fast at prediction time.
For these reasons, it should often be the first model thatyou try in a regression problem. However, we are now going to see more advancedmethods and why they are sometimes preferable.Penalized or regularized regressionThis section introduces penalized regression, also called regularized regression, animportant class of regression models.In ordinary regression, the returned fit is the best fit on the training data. Thiscan lead to over-fitting. Penalizing means that we add a penalty for over-confidencein the parameter values. Thus, we accept a slightly worse fit in order to have asimpler model.Another way to think about it is to consider that the default is that there is norelationship between the input variables and the output prediction.
When we havedata, we change this opinion, but adding a penalty means that we require more datato convince us that this is a strong relationship.Penalized regression is about tradeoffsPenalized regression is another example of the bias-variance tradeoff.When using a penalty, we get a worse fit in the training data, as weare adding bias. On the other hand, we reduce the variance and tendto avoid over-fitting. Therefore, the overall result might generalizebetter to unseen (test) data.[ 163 ]RegressionL1 and L2 penaltiesWe now explore these ideas in detail. Readers who do not care about some of themathematical aspects should feel free to skip directly to the next section on how touse regularized regression in scikit-learn.The problem, in general, is that we are given a matrix X of training data (rows areobservations and each column is a different feature), and a vector y of output values.The goal is to obtain a vector of weights, which we will call b*.
The ordinary leastsquares regression is given by the following formula:That is, we find vector b that minimizes the squared distance to the target y. In theseequations, we ignore the issue of setting an intercept by assuming that the trainingdata has been preprocessed so that the mean of y is zero.Adding a penalty or a regularization means that we do not simply consider the bestfit on the training data, but also how vector is composed. There are two types ofpenalties that are typically used for regression: L1 and L2 penalties. An L1 penaltymeans that we penalize the regression by the sum of the absolute values of thecoefficients, while an L2 penalty penalizes by the sum of squares.When we add an L1 penalty, instead of the preceding equation, we instead optimizethe following:Here, we are trying to simultaneously make the error small, but also make the valuesof the coefficients small (in absolute terms).
Using an L2 penalty, means that we usethe following formula:The difference is rather subtle: we now penalize by the square of the coefficientrather than their absolute value. However, the difference in the results is dramatic.[ 164 ]Chapter 7Ridge, Lasso, and ElasticNetsThese penalized models often go by rather interesting names.