Building machine learning systems with Python (779436), страница 31
Текст из файла (страница 31)
Black represents the absence of a rating and the gray levelsrepresent the rating value.The code to visualize the data is very simple (you can adapt it to show a larger fractionof the matrix than is possible to show in this book), as shown in the following:>>> from matplotlib import pyplot as plt>>> # Build an instance of the object we defined above>>> norm = NormalizePositive(axis=1)>>> binary = (train > 0)>>> train = norm.fit_transform(train)>>> # plot just 200x200 area for space reasons>>> plt.imshow(binary[:200, :200], interpolation='nearest')The following screenshot is the output of this code:[ 181 ]RecommendationsWe can see that the matrix is sparse—most of the squares are black.
We can also seethat some users rate a lot more movies than others and that some movies are thetarget of many more ratings than others.We are now going to use this binary matrix to make predictions of movie ratings.The general algorithm will be (in pseudo code) as follows:1. For each user, rank every other user in terms of closeness. For this step,we will use the binary matrix and use correlation as the measure ofcloseness (interpreting the binary matrix as zeros and ones allowsus to perform this computation).2.
When we need to estimate a rating for a (user, movie) pair, we look at all theusers who have rated that movie and split them into two: the most similarhalf and the most dissimilar half. We then use the average of the most similarhalf as the prediction.We can use the scipy.spatial.distance.pdist function to obtain the distancebetween all the users as a matrix. This function returns the correlation distance,which transforms the correlation value by inverting it so that larger numbers, where ismean less similar. Mathematically, the correlation distance isthe correlation value. The code is as follows:>>> from scipy.spatial import distance>>> # compute all pair-wise distances:>>> dists = distance.pdist(binary, 'correlation')>>> # Convert to square form, so that dists[i,j]>>> # is distance between binary[i] and binary[j]:>>> dists = distance.squareform(dists)We can use this matrix to select the nearest neighbors of each user.
These are theusers that most resemble it.>>> neighbors = dists.argsort(axis=1)Now, we iterate over all users to estimate predictions for all inputs:>>> # We are going to fill this matrix with results>>> filled = train.copy()>>> for u in range(filled.shape[0]):...# n_u is neighbors of user...n_u = neighbors[u, 1:][ 182 ]Chapter 8...# t_u is training data...for m in range(filled.shape[1]):...# get relevant reviews in order!...revs = [train[neigh, m]...for neigh in n_u......if binary[neigh, m]]if len(revs):...# n is the number of reviews for this movie...n = len(revs)...# consider half of the reviews plus one...n //= 2...n += 1...revs = revs[:n]...filled[u,m] = np.mean(revs )The tricky part in the preceding snippet is indexing by the right values to select theneighbors who have rated the movie. Then, we choose the half that is closest to theuser (in the rev[:n] line) and average those.
Because some films have many reviewsand others very few, it is hard to find a single number of users for all cases. Choosinghalf of the available data is a more generic approach.To obtain the final result, we need to un-normalize the predictions as follows:>>> predicted = norm.inverse_transform(filled)We can use the same metrics we learned about in the previous chapter:>>> from sklearn import metrics>>> r2 = metrics.r2_score(test[test > 0], predicted[test > 0])>>> print('R2 score (binary neighbors): {:.1%}'.format(r2))R2 score (binary neighbors): 29.5%The preceding code computes the result for user neighbors, but we can use it tocompute the movie neighbors by simply transposing the input matrix. In fact, thecode computes neighbors of whatever are the rows of its input matrix.[ 183 ]RecommendationsSo we can rerun the following code, by just inserting the following line at the top:>>> reviews = reviews.T>>> # use same code as before …>>> r2 = metrics.r2_score(test[test > 0], predicted[test > 0])>>> print('R2 score (binary movie neighbors): {:.1%}'.format(r2))R2 score (binary movie neighbors): 29.8%Thus we can see that the results are not that different.In this book's code repository, the neighborhood code has been wrapped into asimple function, which makes it easier to reuse.A regression approach to recommendationsAn alternative to neighborhoods is to formulate recommendations as a regressionproblem and apply the methods that we learned in the previous chapter.We also consider why this problem is not a good fit for a classification formulation.We could certainly attempt to learn a five-class model, using one class for eachpossible movie rating.
There are two problems with this approach:• The different possible errors are not at all the same. For example, mistaking a5-star movie for a 4-star one is not as serious a mistake as mistaking a 5-starmovie for a 1-star one.• Intermediate values make sense.
Even if our inputs are only integer values, itis perfectly meaningful to say that the prediction is 4.3. We can see that this isa different prediction than 3.5, even if they both round to 4.These two factors together mean that classification is not a good fit to the problem.The regression framework is a better fit.For a basic approach, we again have two choices: we can build movie-specific oruser-specific models. In our case, we are going to first build user-specific models.This means that, for each user, we take the movies that the user has rated as ourtarget variable. The inputs are the ratings of other users. We hypothesize that thiswill give a high value to users who are similar to our user (or a negative value tousers who like the same movies that our user dislikes).[ 184 ]Chapter 8Setting up the train and test matrices is as before (including running thenormalization steps).
Therefore, we jump directly to the learning step. First,we instantiate a regressor as follows:>>> reg = ElasticNetCV(alphas=[0.0125, 0.025, 0.05, .125, .25, .5, 1., 2., 4.])We build a data matrix, which will contain a rating for every (user, movie) pair. Weinitialize it as a copy of the training data:>>> filled = train.copy()Now, we iterate over all the users, and each time learn a regression model based onlyon the data that that user has given us:>>> for u in range(train.shape[0]):...curtrain = np.delete(train, u, axis=0)...# binary records whether this rating is present...bu = binary[u]...# fit the current user based on everybody else...reg.fit(curtrain[:,bu].T, train[u, bu])...# Fill in all the missing ratings...filled[u, ~bu] = reg.predict(curtrain[:,~bu].T)Evaluating the method can be done exactly as before:>>> predicted = norm.inverse_transform(filled)>>> r2 = metrics.r2_score(test[test > 0], predicted[test > 0])>>> print('R2 score (user regression): {:.1%}'.format(r2))R2 score (user regression): 32.3%As before, we can adapt this code to perform movie regression by using thetransposed matrix.[ 185 ]RecommendationsCombining multiple methodsWe now combine the aforementioned methods in a single prediction.
This seemsintuitively a good idea, but how can we do this in practice? Perhaps, the first thoughtthat comes to mind is that we can average the predictions. This might give decentresults, but there is no reason to think that all estimated predictions should be treatedthe same. It might be that one is better than others.We can try a weighted average, multiplying each prediction by a given weight beforesumming it all up. How do we find the best weights, though? We learn them fromthe data, of course!Ensemble learningWe are using a general technique in machine learning, which isnot only applicable in regression: ensemble learning. We learn anensemble (that is, a set) of predictors.
Then, we to combine them toobtain a single output. What is interesting is that we can see eachprediction as being a new feature, and we are now just combiningfeatures based on training data, which is what we have been doingall along. Note that we are doing so for regression here, but the samereasoning is applicable to classification: you learn several classifiers,then a master classifier, which takes the output of all of them andgives a final prediction. Different forms of ensemble learning differ onhow you combine the base predictors.In order to combine the methods, we will use a technique called stacked learning.The idea is you learn a set of predictors, then you use the output of these predictorsas features for another predictor. You can even have several layers, where each layerlearns by using the output of the previous layer as features for its prediction.
Have alook at the following diagram:[ 186 ]Chapter 8In order to fit this combination model, we will split the training data into two.Alternatively, we could have used cross-validation (the original stacked learningmodel worked like this). However, in this case, we have enough data to obtain goodestimates by leaving some aside.As when fitting hyperparameters, though, we need two layers of training/testingsplits: a first, higher-level split, and then, inside the training split, a second split tobe able to fit the stacked learner, as show in the following:>>> train,test = load_ml100k.get_train_test(random_state=12)>>> # Now split the training again into two subgroups>>> tr_train,tr_test = load_ml100k.get_train_test(train,random_state=34)>>> # Call all the methods we previously defined:>>> # these have been implemented as functions:>>> tr_predicted0 = regression.predict(tr_train)>>> tr_predicted1 = regression.predict(tr_train.T).T>>> tr_predicted2 = corrneighbours.predict(tr_train)>>> tr_predicted3 = corrneighbours.predict(tr_train.T).T>>> tr_predicted4 = norm.predict(tr_train)>>> tr_predicted5 = norm.predict(tr_train.T).T>>> # Now assemble these predictions into a single array:>>> stack_tr = np.array([...tr_predicted0[tr_test > 0],...tr_predicted1[tr_test > 0],...tr_predicted2[tr_test > 0],...tr_predicted3[tr_test > 0],...tr_predicted4[tr_test > 0],...tr_predicted5[tr_test > 0],...]).T>>> # Fit a simple linear regression>>> lr = linear_model.LinearRegression()>>> lr.fit(stack_tr, tr_test[tr_test > 0])[ 187 ]RecommendationsNow, we apply the whole process to the testing split and evaluate:>>> stack_te = np.array([...tr_predicted0.ravel(),...tr_predicted1.ravel(),...tr_predicted2.ravel(),...tr_predicted3.ravel(),...tr_predicted4.ravel(),...tr_predicted5.ravel(),...]).T>>> predicted = lr.predict(stack_te).reshape(train.shape)Evaluation is as before:>>> r2 = metrics.r2_score(test[test > 0], predicted[test > 0])>>> print('R2 stacked: {:.2%}'.format(r2))R2 stacked: 33.15%The result of stacked learning is better than what any single method had achieved.