Building machine learning systems with Python (779436), страница 30

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 30 страницаBuilding machine learning systems with Python (779436) страница 302017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 30)

In this machine learning problem, younot only have the information about which films the user saw but also about how theuser rated them.In 2006, Netflix made a large number of customer ratings of films in its databaseavailable for a public challenge. The goal was to improve on their in-house algorithmfor rating prediction.

Whoever would be able to beat it by 10 percent or more wouldwin 1 million dollars. In 2009, an international team named BellKor's PragmaticChaos was able to beat this mark and take the prize. They did so just 20 minutesbefore another team, The Ensemble, and passed the 10 percent mark as well—anexciting photo finish for a competition that lasted several years.Machine learning in the real worldMuch has been written about the Netflix Prize, and you may learn alot by reading up on it.

The techniques that won were a mixture ofadvanced machine learning and a lot of work put into preprocessingthe data. For example, some users like to rate everything very highly,while others are always more negative; if you do not account for this inpreprocessing, your model will suffer. Other normalizations were alsonecessary for a good result: how old is the film and how many ratingsdid it receive.

Good algorithms are a good thing, but you always needto "get your hands dirty" and tune your methods to the properties of thedata you have in front of you. Preprocessing and normalizing the datais often the most time-consuming part of the machine learning process.However, this is also the place where one can have the biggest impacton the final performance of the system.The first thing to note about the Netflix Prize is how hard it was. Roughlyspeaking, the internal system that Netflix used was about 10 percent better than norecommendations (that is, assigning each movie just the average value for all users).The goal was to obtain just another 10 percent improvement on this. In total, thewinning system was roughly just 20 percent better than no personalization.

Yet, ittook a tremendous amount of time and effort to achieve this goal. And even though20 percent does not seem like much, the result is a system that is useful in practice.[ 176 ]Chapter 8Unfortunately, for legal reasons, this dataset is no longer available. Although thedata was anonymous, there were concerns that it might be possible to discover whothe clients were and reveal private details of movie rentals.

However, we can use anacademic dataset with similar characteristics. This data comes from GroupLens, aresearch laboratory at the University of Minnesota.How can we solve a Netflix style ratings prediction question? We will see twodifferent approaches, neighborhood approaches and regression approaches.We will also see how to combine these to obtain a single prediction.Splitting into training and testingAt a high level, splitting the dataset into training and testing data in order to obtaina principled estimate of the system's performance is performed as in previouschapters: we take a certain fraction of our data points (we will use 10 percent) andreserve them for testing; the rest will be used for training. However, because the datais structured differently in this context, the code is different.

The first step is to loadthe data from disk, for which we use the following function:def load():import numpy as npfrom scipy import sparsedata = np.loadtxt('data/ml-100k/u.data')ij = data[:, :2]ij -= 1# original data is in 1-based systemvalues = data[:, 2]reviews = sparse.csc_matrix((values, ij.T)).astype(float)return reviews.toarray()Note that zero entries in this matrix represent missing ratings.>>> reviews = load()>>> U,M = np.where(reviews)We now use the standard random module to choose indices to test:>>> import random>>> test_idxs = np.array(random.sample(range(len(U)), len(U)//10))[ 177 ]RecommendationsNow, we build the train matrix, which is like reviews, but with the testing entriesset to zero:>>> train = reviews.copy()>>> train[U[test_idxs], M[test_idxs]] = 0Finally, the test matrix contains just the testing values:>>> test = np.zeros_like(reviews)>>> test[U[test_idxs], M[test_idxs]] = reviews[U[test_idxs],M[test_idxs]]From now on, we will work on taking the training data, and try to predict all themissing entries in the dataset.

That is, we will write code that assigns each (user,movie) pair a recommendation.Normalizing the training dataAs we discussed, it is best to normalize the data to remove obvious movie oruser-specific effects. We will just use one very simple type of normalization,which we used before: conversion to z-scores.Unfortunately, we cannot simply use scikit-learn's normalization objects as we haveto deal with the missing values in our data (that is, not all movies were rated by allusers). Thus, we want to normalize by the mean and standard deviation of the valuesthat are, in fact, present.We will write our own class, which ignores missing values. This class will follow thescikit-learn preprocessing API:class NormalizePositive(object):We want to choose the axis of normalization. By default, we normalize along the firstaxis, but sometimes it will be useful to normalize along the second one.

This followsthe convention of many other NumPy-related functions:def __init__(self, axis=0):self.axis = axisThe most important method is the fit method. In our implementation, we computethe mean and standard deviation of the values that are not zero. Recall that zerosindicate "missing values":def fit(self, features, y=None):[ 178 ]Chapter 8If the axis is 1, we operate on the transposed array as follows:if self.axis == 1:features = features.T#count features that are greater than zero in axis 0:binary = (features > 0)count0 = binary.sum(axis=0)# to avoid division by zero, set zero counts to one:count0[count0 == 0] = 1.# computing the mean is easy:self.mean = features.sum(axis=0)/count0# only consider differences where binary is True:diff = (features - self.mean) * binarydiff **= 2# regularize the estimate of std by adding 0.1self.std = np.sqrt(0.1 + diff.sum(axis=0)/count0)return selfWe add 0.1 to the direct estimate of the standard deviation to avoid underestimatingthe value of the standard deviation when there are only a few samples, all of whichmay be exactly the same.

The exact value used does not matter much for the finalresult, but we need to avoid division by zero.The transform method needs to take care of maintaining the binary structureas follows:def transform(self, features):if self.axis == 1:features = features.Tbinary = (features > 0)features = features - self.meanfeatures /= self.stdfeatures *= binaryif self.axis == 1:features = features.Treturn features[ 179 ]RecommendationsNotice how we took care of transposing the input matrix when the axis is 1 andthen transformed it back so that the return value has the same shape as the input.The inverse_transform method performs the inverse operation to transform asshown here:def inverse_transform(self, features, copy=True):if copy:features = features.copy()if self.axis == 1:features = features.Tfeatures *= self.stdfeatures += self.meanif self.axis == 1:features = features.Treturn featuresFinally, we add the fit_transform method which, as the name indicates, combinesboth the fit and transform operations:def fit_transform(self, features):return self.fit(features).transform(features)The methods that we defined (fit, transform, transform_inverse, and fit_transform) were the same as the objects defined in the sklearn.preprocessingmodule.

In the following sections, we will first normalize the inputs, generatenormalized predictions, and finally apply the inverse transformation to obtainthe final predictions.A neighborhood approach torecommendationsThe neighborhood concept can be implemented in two ways: user neighbors ormovie neighbors. User neighborhoods are based on a very simple concept: to knowhow a user will rate a movie, find the users most similar to them, and look at theirratings. We will only consider user neighbors for the moment. At the end of thissection, we will discuss how the code can be adapted to compute movie neighbors.[ 180 ]Chapter 8One of the interesting techniques that we will now explore is to just see whichmovies each user has rated, even without taking a look at what rating was given.Even with a binary matrix where we have an entry equal to 1 when a user rates amovie, and 0 when they did not, we can make useful predictions.

In hindsight, thismakes perfect sense; we do not completely randomly choose movies to watch, butinstead pick those where we already have an expectation of liking them. We alsodo not make random choices of which movies to rate, but perhaps only rate thosewe feel most strongly about (naturally, there are exceptions, but on average this isprobably true).We can visualize the values of the matrix as an image, where each rating is depictedas a little square.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.