Building machine learning systems with Python (779436), страница 22

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 22 страницаBuilding machine learning systems with Python (779436) страница 222017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 22)

Let's find out what threshold we need for that. As wetrained many classifiers on different folds (remember, we iterated over KFold() acouple of pages back), we need to retrieve the classifier that was neither too bad nortoo good in order to get a realistic view. Let's call it the medium clone:>>> medium = np.argsort(scores)[int(len(scores) / 2)]>>> thresholds = np.hstack(([0],thresholds[medium]))>>> idx80 = precisions>=0.8>>> print("P=%.2f R=%.2f thresh=%.2f" % (precision[idx80][0],recall[idx80][0], threshold[idx80][0]))P=0.80 R=0.37 thresh=0.59[ 118 ]Chapter 5Setting the threshold at 0.59, we see that we can still achieve a precision of 80percent detecting good answers when we accept a low recall of 37 percent.

Thatmeans that we would detect only one in three good answers as such. But from thatthird of good answers that we manage to detect, we would be reasonably sure thatthey are indeed good. For the rest, we could then politely display additional hints onhow to improve answers in general.To apply this threshold in the prediction process, we have to use predict_proba(),which returns per class probabilities, instead of predict(), which returns theclass itself:>>> thresh80 = threshold[idx80][0]>>> probs_for_good = clf.predict_proba(answer_features)[:,1]>>> answer_class = probs_for_good>thresh80We can confirm that we are in the desired precision/recall range usingclassification_report:>>> from sklearn.metrics import classification_report>>> print(classification_report(y_test, clf.predict_proba [:,1]>0.63,target_names=['not accepted', 'accepted']))precisionrecallf1-scoresupportnot accepted0.590.850.70101accepted0.730.400.5299avg / total0.660.630.61200Note that using the threshold will not guarantee that we arealways above the precision and recall values that we determinedabove together with its threshold.[ 119 ]Classification – Detecting Poor AnswersSlimming the classifierIt is always worth looking at the actual contributions of the individual features.For logistic regression, we can directly take the learned coefficients (clf.coef_)to get an impression of the features' impact.

The higher the coefficient of a feature,the more the feature plays a role in determining whether the post is good ornot. Consequently, negative coefficients tell us that the higher values for thecorresponding features indicate a stronger signal for the post to be classified as bad.We see that LinkCount, AvgWordLen, NumAllCaps, and NumExclams have the biggestimpact on the overall classification decision, while NumImages (a feature that wesneaked in just for demonstration purposes a second ago) and AvgSentLen play arather minor role. While the feature importance overall makes sense intuitively, it issurprising that NumImages is basically ignored. Normally, answers containing imagesare always rated high.

In reality, however, answers very rarely have images. So,although in principal it is a very powerful feature, it is too sparse to be of any value.We could easily drop that feature and retain the same classification performance.[ 120 ]Chapter 5Ship it!Let's assume we want to integrate this classifier into our site. What we definitely donot want is training the classifier each time we start the classification service.

Instead,we can simply serialize the classifier after training and then deserialize on that site:>>> import pickle>>> pickle.dump(clf, open("logreg.dat", "w"))>>> clf = pickle.load(open("logreg.dat", "r"))Congratulations, the classifier is now ready to be used as if it had just been trained.SummaryWe made it! For a very noisy dataset, we built a classifier that suits a part of our goal.Of course, we had to be pragmatic and adapt our initial goal to what was achievable.But on the way we learned about strengths and weaknesses of nearest neighborand logistic regression.

We learned how to extract features such as LinkCount,NumTextTokens, NumCodeLines, AvgSentLen, AvgWordLen, NumAllCaps, NumExclams,and NumImages, and how to analyze their impact on the classifier's performance.But what is even more valuable is that we learned an informed way of how to debugbad performing classifiers. That will help us in the future to come up with usablesystems much faster.After having looked into nearest neighbor and logistic regression, in the nextchapter, we will get familiar with yet another simple yet powerful classificationalgorithm: Naïve Bayes.

Along the way, we will also learn some more convenienttools from scikit-learn.[ 121 ]Classification II – SentimentAnalysisFor companies, it is vital to closely monitor the public reception of key events, suchas product launches or press releases. With its real-time access and easy accessibilityof user-generated content on Twitter, it is now possible to do sentiment classificationof tweets. Sometimes also called opinion mining, it is an active field of research, inwhich several companies are already selling such services.

As this shows that thereobviously exists a market, we have motivation to use our classification muscles builtin the last chapter, to build our own home-grown sentiment classifier.Sketching our roadmapSentiment analysis of tweets is particularly hard, because of Twitter's size limitationof 140 characters. This leads to a special syntax, creative abbreviations, and seldomwell-formed sentences. The typical approach of analyzing sentences, aggregatingtheir sentiment information per paragraph, and then calculating the overallsentiment of a document does not work here.Clearly, we will not try to build a state-of-the-art sentiment classifier. Instead, wewant to:• Use this scenario as a vehicle to introduce yet another classificationalgorithm, Naïve Bayes• Explain how Part Of Speech (POS) tagging works and how it can help us• Show some more tricks from the scikit-learn toolbox that come in handy fromtime to time[ 123 ]Classification II – Sentiment AnalysisFetching the Twitter dataNaturally, we need tweets and their corresponding labels that tell whether a tweet iscontaining a positive, negative, or neutral sentiment.

In this chapter, we will use thecorpus from Niek Sanders, who has done an awesome job of manually labeling morethan 5,000 tweets and has granted us permission to use it in this chapter.To comply with Twitter's terms of services, we will not provide any data fromTwitter nor show any real tweets in this chapter. Instead, we can use Sander'shand-labeled data, which contains the tweet IDs and their hand-labeled sentiment,and use his script, install.py, to fetch the corresponding Twitter data. As the scriptis playing nice with Twitter's servers, it will take quite some time to download all thedata for more than 5,000 tweets. So it is a good idea to start it right away.The data comes with four sentiment labels:>>> X, Y = load_sanders_data()>>> classes = np.unique(Y)>>> for c in classes: print("#%s: %i" % (c, sum(Y==c)))#irrelevant: 490#negative: 487#neutral: 1952#positive: 433Inside load_sanders_data(), we are treating irrelevant and neutral labels togetheras neutral and drop ping all non-English tweets, resulting in 3,362 tweets.In case you get different counts here, it is because, in the meantime, tweets mighthave been deleted or set to be private.

In that case, you might also get slightlydifferent numbers and graphs than the ones shown in the upcoming sections.Introducing the Naïve Bayes classifierNaïve Bayes is probably one of the most elegant machine learning algorithms outthere that is of practical use. And despite its name, it is not that naïve when you lookat its classification performance. It proves to be quite robust to irrelevant features,which it kindly ignores. It learns fast and predicts equally so. It does not require lotsof storage. So, why is it then called naïve?[ 124 ]Chapter 6The Naïve was added to account for one assumption that is required for Naïve Bayesto work optimally.

The assumption is that the features do not impact each other.This, however, is rarely the case for real-world applications. Nevertheless, it stillreturns very good accuracy in practice even when the independence assumptiondoes not hold.Getting to know the Bayes' theoremAt its core, Naïve Bayes classification is nothing more than keeping track of whichfeature gives evidence to which class. The way the features are designed determinesthe model that is used to learn. The so-called Bernoulli model only cares aboutBoolean features: whether a word occurs only once or multiple times in a tweet doesnot matter.

In contrast, the Multinomial model uses word counts as features. Forthe sake of simplicity, we will use the Bernoulli model to explain how to use NaïveBayes for sentiment analysis. We will then use the Multinomial model later on to setup and tune our real-world classifiers.Let's assume the following meanings for the variables that we will use to explainNaïve Bayes:VariableMeaningThis is the class of a tweet (positive or negative)The word "awesome" occurs at least once in the tweetThe word "crazy" occurs at least once in the tweetDuring training, we learned the Naïve Bayes model, which is the probability fora class when we already know featuresand .

This probability is written as.Since we cannot estimateby Bayes:directly, we apply a trick, which was found out[ 125 ]Classification II – Sentiment AnalysisIf we substitute with the probability of both words "awesome" and "crazy", andthink of as being our class , we arrive at the relationship that helps us to laterretrieve the probability for the data instance belonging to the specified class:This allows us to expressby means of the other probabilities:We could also describe this as follows:The prior and the evidence are easily determined:••is the prior probability of class without knowing about the data. Wecan estimate this quantity by simply calculating the fraction of all trainingdata instances belonging to that particular class.is the evidence or the probability of featuresand..

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.