Building machine learning systems with Python (779436), страница 21

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 21 страницаBuilding machine learning systems with Python (779436) страница 212017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 21)

Of course, we will never knowwhether there is the one golden feature we just did not happen to think of. But fornow, let's move on to another classification method that is known to work great intext-based classification scenarios.[ 111 ]Classification – Detecting Poor AnswersUsing logistic regressionContrary to its name, logistic regression is a classification method.

It is a verypowerful one when it comes to text-based classification; it achieves this by firstdoing a regression on a logistic function, hence the name.A bit of math with a small exampleTo get an initial understanding of the way logistic regression works, let's first takea look at the following example where we have artificial feature values X plottedwith the corresponding classes, 0 or 1. As we can see, the data is noisy such thatclasses overlap in the feature value range between 1 and 6. Therefore, it is better tonot directly model the discrete classes, but rather the probability that a feature valuebelongs to class 1, P(X).

Once we possess such a model, we could then predict class 1if P(X)>0.5, and class 0 otherwise.Mathematically, it is always difficult to model something that has a finite range,as is the case here with our discrete labels 0 and 1. We can, however, tweak theprobabilities a bit so that they always stay between 0 and 1.

And for that, we willneed the odds ratio and the logarithm of it.[ 112 ]Chapter 5Let's say a feature has the probability of 0.9 that it belongs to class 1, P(y=1) = 0.9. Theodds ratio is then P(y=1)/P(y=0) = 0.9/0.1 = 9. We could say that the chance is 9:1 thatthis feature maps to class 1. If P(y=0.5), we would consequently have a 1:1 chancethat the instance is of class 1. The odds ratio is bounded by 0, but goes to infinity(the left graph in the following set of graphs). If we now take the logarithm of it, wecan map all probabilities between 0 and 1 to the full range from negative to positiveinfinity (the right graph in the following set of graphs).

The nice thing is that we stillmaintain the relationship that higher probability leads to a higher log of odds, justnot limited to 0 and 1 anymore.This means that we can now fit linear combinations of our features (OK, we onlyvalues. In ahave one and a constant, but that will change soon) to thesense, we replace the linear from Chapter 1, Getting Started with Python Machine pi 1 − pi = c0 + c1 x (replacing y with log(odds)).1We can solve this for pi, so that we have pi =−( c0 + c1 xi ) .1+ eLearning, yi = c0 + c1 xi with log We simply have to find the right coefficients, such that the formula gives the lowesterrors for all our (xi, pi) pairs in our data set, but that will be done by scikit-learn.After fitting, the formula will give the probability for every new data point x thatbelongs to class 1:>>> from sklearn.linear_model import LogisticRegression>>> clf = LogisticRegression()>>> print(clf)[ 113 ]Classification – Detecting Poor AnswersLogisticRegression(C=1.0, class_weight=None, dual=False,fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001)>>> clf.fit(X, y)>>> print(np.exp(clf.intercept_), np.exp(clf.coef_.ravel()))[ 0.09437188] [ 1.80094112]>>> def lr_model(clf, X):...return 1 / (1 + np.exp(-(clf.intercept_ + clf.coef_*X)))>>> print("P(x=-1)=%.2f\tP(x=7)=%.2f"%(lr_model(clf, -1),lr_model(clf, 7)))P(x=-1)=0.05P(x=7)=0.85You might have noticed that scikit-learn exposes the first coefficient through thespecial field intercept_.If we plot the fitted model, we see that it makes perfect sense given the data:Applying logistic regression to our postclassification problemAdmittedly, the example in the previous section was created to show the beauty oflogistic regression.

How does it perform on the real, noisy data?[ 114 ]Chapter 5Comparing it to the best nearest neighbor classifier (k=40) as a baseline, we see that itperforms a bit better, but also won't change the situation a whole lot.Methodmean(scores)stddev(scores)LogReg C=0.10.646500.03139LogReg C=1.000.646500.03155LogReg C=10.000.645500.03102LogReg C=0.010.638500.0195040NN0.628000.03750We have shown the accuracy for different values of the regularization parameterC. With it, we can control the model complexity, similar to the parameter k for thenearest neighbor method.

Smaller values for C result in more penalization of themodel complexity.A quick look at the bias-variance chart for one of our best candidates, C=0.1, showsthat our model has high bias—test and train error curves approach closely but stayat unacceptable high values. This indicates that logistic regression with the currentfeature space is under-fitting and cannot learn a model that captures the data correctly:[ 115 ]Classification – Detecting Poor AnswersSo what now? We switched the model and tuned it as much as we could with ourcurrent state of knowledge, but we still have no acceptable classifier.More and more it seems that either the data is too noisy for this task or that our set offeatures is still not appropriate to discriminate the classes well enough.Looking behind accuracy – precision andrecallLet's step back and think again about what we are trying to achieve here.

Actually,we do not need a classifier that perfectly predicts good and bad answers as wemeasured it until now using accuracy. If we can tune the classifier to be particularlygood at predicting one class, we could adapt the feedback to the user accordingly. Ifwe, for example, had a classifier that was always right when it predicted an answerto be bad, we would give no feedback until the classifier detected the answer to bebad. On the contrary, if the classifier exceeded in predicting answers to be good, wecould show helpful comments to the user at the beginning and remove them whenthe classifier said that the answer is a good one.To find out in which situation we are here, we have to understand how to measureprecision and recall. And to understand that, we have to look into the four distinctclassification results as they are described in the following table:Classified asIn realityit isPositiveNegativePositiveTrue positive (TP)False negative (FN)NegativeFalse positive (FP)True negative (TN)For instance, if the classifier predicts an instance to be positive and the instanceindeed is positive in reality, this is a true positive instance.

If on the other hand theclassifier misclassified that instance, saying that it is negative while in reality it waspositive, that instance is said to be a false negative.What we want is to have a high success rate when we are predicting a post as eithergood or bad, but not necessarily both. That is, we want as much true positives aspossible. This is what precision captures:Precision =TPTP + FP[ 116 ]Chapter 5If instead our goal would have been to detect as much good or bad answers aspossible, we would be more interested in recall:Recall =TPTP + FNIn terms of the following graphic, precision is the fraction of the intersection of theright circle while recall is the fraction of the intersection of the left circle:So, how can we now optimize for precision? Up to now, we have always used 0.5as the threshold to decide whether an answer is good or not.

What we can do nowis count the number of TP, FP, and FN while varying that threshold between 0 and 1.With those counts, we can then plot precision over recall.The handy function precision_recall_curve() from the metrics module does allthe calculations for us:>>> from sklearn.metrics import precision_recall_curve>>> precision, recall, thresholds = precision_recall_curve(y_test,clf.predict(X_test))[ 117 ]Classification – Detecting Poor AnswersPredicting one class with acceptable performance does not always mean thatthe classifier is also acceptable predicting the other class.

This can be seen in thefollowing two plots, where we plot the precision/recall curves for classifying bad(the left graph) and good (the right graph) answers:In the graphs, we have also included a much better description ofa classifier's performance, the area under curve (AUC). It can beunderstood as the average precision of the classifier and is a greatway of comparing different classifiers.We see that we can basically forget predicting bad answers (the left plot). Precisiondrops to a very low recall and stays at an unacceptably low 60 percent.Predicting good answers, however, shows that we can get above 80 percent precisionat a recall of almost 40 percent.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.