Building machine learning systems with Python (779436), страница 29
Текст из файла (страница 29)
There is one problem with this solution, though, which is the choice of alpha.When using the default value (1.0), the result is very different (and worse).In this case, we cheated as the author had previously tried a few values to see whichones would give a good result. This is not effective and can lead to over estimates ofconfidence (we are looking at the test data to decide which parameter values to useand which we should never use). The next section explains how to do it properly andhow this is supported by scikit-learn.Setting hyperparameters in a principled wayIn the preceding example, we set the penalty parameter to 0.1.
We could just as wellhave set it to 0.7 or 23.9. Naturally, the results vary each time. If we pick an overlylarge value, we get underfitting. In the extreme case, the learning system will justreturn every coefficient equal to zero. If we pick a value that is too small, we are veryclose to OLS, which overfits and generalizes poorly (as we saw earlier).How do we choose a good value? This is a general problem in machinelearning: setting parameters for our learning methods. A generic solution is to usecross-validation.
We pick a set of possible values, and then use cross-validation tochoose which one is best. This performs more computation (five times more if weuse five folds), but is always applicable and unbiased.[ 170 ]Chapter 7We must be careful, though. In order to obtain an estimate of generalization, we haveto use two-levels of cross-validation: one level is to estimate the generalization, whilethe second level is to get good parameters. That is, we split the data in, for example,five folds.
We start by holding out the first fold and will learn on the other four. Now,we split these again into 5 folds in order to choose the parameters. Once we have setour parameters, we test on the first fold. Now, we repeat this four other times:The preceding figure shows how you break up a single training fold into subfolds.We would need to repeat it for all the other folds. In this case, we are looking at fiveouter folds and five inner folds, but there is no reason to use the same number ofouter and inner folds, you can use any number you want as long as you keep thefolds separate.This leads to a lot of computation, but it is necessary in order to do things correctly.The problem is that if you use a piece of data to make any decisions about yourmodel (including which parameters to set), you have contaminated it and you canno longer use it to test the generalization ability of your model.
This is a subtle pointand it may not be immediately obvious. In fact, it is still the case that many users ofmachine learning get this wrong and overestimate how well their systems are doing,because they do not perform cross-validation correctly!Fortunately, scikit-learn makes it very easy to do the right thing; it providesclasses named LassoCV, RidgeCV, and ElasticNetCV, all of which encapsulatean inner cross-validation loop to optimize for the necessary parameter.
The codeis almost exactly like the previous one, except that we do not need to specify anyvalue for alpha:>>> from sklearn.linear_model import ElasticNetCV>>> met = ElasticNetCV()>>> kf = KFold(len(target), n_folds=5)>>> p = np.zeros_like(target)>>> for train,test in kf:[ 171 ]Regression...met.fit(data[train],target[train])...p[test] = met.predict(data[test])>>> r2_cv = r2_score(target, p)>>> print("R2 ElasticNetCV: {:.2}".format(r2_cv))R2 ElasticNetCV: 0.65This results in a lot of computation, so you may want to get some coffee whileyou are waiting (depending on how fast your computer is). You might get betterperformance by taking advantage of multiple processors.
This is a built-in feature ofscikit-learn, which can be accessed quite trivially by using the n_jobs parameter tothe ElasticNetCV constructor. To use four CPUs, make use of the following code:>>> met = ElasticNetCV(n_jobs=4)Set the n_jobs parameter to -1 to use all the available CPUs:>>> met = ElasticNetCV(n_jobs=-1)You may have wondered why, if ElasticNets have two penalties, the L1 and theL2 penalty, we only need to set a single value for alpha. In fact, the two values arespecified by separately specifying alpha and the l1_ratio variable (that is spelledell-1-underscore-ratio). Then, α1 and α2 are set as follows (where ρ stands for l1_ratio):In an intuitive sense, alpha sets the overall amount of regularization while l1_ratiosets the tradeoff between the different types of regularization, L1 and L2.We can request that the ElasticNetCV object tests different values of l1_ratio, as isshown in the following code:>>> l1_ratio=[.01, .05, .25, .5, .75, .95, .99]>>> met = ElasticNetCV(l1_ratio=l1_ratio,n_jobs=-1)This set of l1_ratio values is recommended in the documentation.
It will testmodels that are almost like Ridge (when l1_ratio is 0.01 or 0.05) as well as modelsthat are almost like Lasso (when l1_ratio is 0.95 or 0.99). Thus, we explore a fullrange of different options.[ 172 ]Chapter 7Because of its flexibility and the ability to use multiple CPUs, ElasticNetCV isan excellent default solution for regression problems when you don't have anyparticular reason to prefer one type of model over the rest.Putting all this together, we can now visualize the prediction versus real fit on thislarge dataset:>>> l1_ratio = [.01, .05, .25, .5, .75, .95, .99]>>> met = ElasticNetCV(l1_ratio=l1_ratio,n_jobs=-1)>>> p = np.zeros_like(target)>>> for train,test in kf:......met.fit(data[train],target[train])p[test] = met.predict(data[test])>>> plt.scatter(p, y)>>> # Add diagonal line for reference>>> # (represents perfect agreement)>>> plt.plot([p.min(), p.max()], [p.min(), p.max()])This results in the following plot:[ 173 ]RegressionWe can see that the predictions do not match very well on the bottom end of the valuerange.
This is perhaps because there are so many fewer elements on this end of thetarget range (which also implies that this affects only a small minority of datapoints).One last note: the approach of using an inner cross-validation loop to set a parameteris also available in scikit-learn using a grid search. In fact, we already used it in theprevious chapter.SummaryIn this chapter, we started with the oldest trick in the book, ordinary least squaresregression.
Although centuries old, it is still often the best solution for regression.However, we also saw more modern approaches that avoid overfitting and can giveus better results especially when we have a large number of features. We used Ridge,Lasso, and ElasticNets; these are the state-of-the-art methods for regression.We saw, once again, the danger of relying on training error to estimategeneralization: it can be an overly optimistic estimate to the point where our modelhas zero training error, but we know that it is completely useless.
When thinkingthrough these issues, we were led into two-level cross-validation, an important pointthat many in the field still have not completely internalized.Throughout this chapter, we were able to rely on scikit-learn to support all theoperations we wanted to perform, including an easy way to achieve correctcross-validation. ElasticNets with an inner cross-validation loop for parameteroptimization (as implemented in scikit-learn by ElasticNetCV) should probablybecome your default method for regression.One reason to use an alternative is when you are interested in a sparse solution.
Inthis case, a pure Lasso solution is more appropriate as it will set many coefficientsto zero. It will also allow you to discover from the data a small number of variables,which are important to the output. Knowing the identity of these may be interestingin and of itself, in addition to having a good regression model.In the next chapter, we will look at recommendations, another machine learningproblem. Our first approach will be to use regression to predict consumer productratings. We will then see alternative models to generate recommendations.[ 174 ]RecommendationsRecommendations have become one of the staples of online services and commerce.This type of automated system can provide each user with a personalized list ofsuggestions (be it a list of products to purchase, features to use, or new connections).In this chapter, we will see the basic ways in which automated recommendationgeneration systems work.
The field of recommendation based on consumer inputs isoften called collaborative filtering, as the users collaborate through the system to findthe best items for each other.In the first part of this chapter, we will see how we can use past product ratings fromconsumers to predict new ratings. We start with a few ideas that are helpful and thencombine all of them.
When combining, we use regression to learn the best way inthey can be combined. This will also allow us to explore a generic concept in machinelearning: ensemble learning.In the second part of this chapter, we will take a look at a different way of learningrecommendations: basket analysis. Unlike the case in which we have numeric ratings,in the basket analysis setting, all we have is information about the shopping baskets,that is, what items were bought together. The goal is to learn about recommendations.You have probably already seen features of the form "people who bought X alsobought Y" in online shopping. We will develop a similar feature of our own.Rating predictions and recommendationsIf you have used any online shopping system in the last 10 years, you haveprobably seen these recommendations.
Some are like Amazon's "costumers whobought X also bought Y". These will be dealt with later in the chapter in the Basketanalysis section. Other recommendations are based on predicting the rating of aproduct, such as a movie.[ 175 ]RecommendationsThe problem of learning recommendations based on past product ratings was madefamous by the Netflix Prize, a million-dollar machine learning public challenge byNetflix. Netflix (well-known in the USA and UK and in a process of internationalexpansion) is a movie rental company. Traditionally, you would receive DVDs inthe mail; more recently, Netflix has focused on the online streaming of movies andTV shows. From the start, one of the distinguishing features of the service was that itgives users the option to rate the films they have seen. Netflix then uses these ratingsto recommend other films to its customers.