Building machine learning systems with Python (779436), страница 22
Текст из файла (страница 22)
Let's find out what threshold we need for that. As wetrained many classifiers on different folds (remember, we iterated over KFold() acouple of pages back), we need to retrieve the classifier that was neither too bad nortoo good in order to get a realistic view. Let's call it the medium clone:>>> medium = np.argsort(scores)[int(len(scores) / 2)]>>> thresholds = np.hstack(([0],thresholds[medium]))>>> idx80 = precisions>=0.8>>> print("P=%.2f R=%.2f thresh=%.2f" % (precision[idx80][0],recall[idx80][0], threshold[idx80][0]))P=0.80 R=0.37 thresh=0.59[ 118 ]Chapter 5Setting the threshold at 0.59, we see that we can still achieve a precision of 80percent detecting good answers when we accept a low recall of 37 percent.
Thatmeans that we would detect only one in three good answers as such. But from thatthird of good answers that we manage to detect, we would be reasonably sure thatthey are indeed good. For the rest, we could then politely display additional hints onhow to improve answers in general.To apply this threshold in the prediction process, we have to use predict_proba(),which returns per class probabilities, instead of predict(), which returns theclass itself:>>> thresh80 = threshold[idx80][0]>>> probs_for_good = clf.predict_proba(answer_features)[:,1]>>> answer_class = probs_for_good>thresh80We can confirm that we are in the desired precision/recall range usingclassification_report:>>> from sklearn.metrics import classification_report>>> print(classification_report(y_test, clf.predict_proba [:,1]>0.63,target_names=['not accepted', 'accepted']))precisionrecallf1-scoresupportnot accepted0.590.850.70101accepted0.730.400.5299avg / total0.660.630.61200Note that using the threshold will not guarantee that we arealways above the precision and recall values that we determinedabove together with its threshold.[ 119 ]Classification – Detecting Poor AnswersSlimming the classifierIt is always worth looking at the actual contributions of the individual features.For logistic regression, we can directly take the learned coefficients (clf.coef_)to get an impression of the features' impact.
The higher the coefficient of a feature,the more the feature plays a role in determining whether the post is good ornot. Consequently, negative coefficients tell us that the higher values for thecorresponding features indicate a stronger signal for the post to be classified as bad.We see that LinkCount, AvgWordLen, NumAllCaps, and NumExclams have the biggestimpact on the overall classification decision, while NumImages (a feature that wesneaked in just for demonstration purposes a second ago) and AvgSentLen play arather minor role. While the feature importance overall makes sense intuitively, it issurprising that NumImages is basically ignored. Normally, answers containing imagesare always rated high.
In reality, however, answers very rarely have images. So,although in principal it is a very powerful feature, it is too sparse to be of any value.We could easily drop that feature and retain the same classification performance.[ 120 ]Chapter 5Ship it!Let's assume we want to integrate this classifier into our site. What we definitely donot want is training the classifier each time we start the classification service.
Instead,we can simply serialize the classifier after training and then deserialize on that site:>>> import pickle>>> pickle.dump(clf, open("logreg.dat", "w"))>>> clf = pickle.load(open("logreg.dat", "r"))Congratulations, the classifier is now ready to be used as if it had just been trained.SummaryWe made it! For a very noisy dataset, we built a classifier that suits a part of our goal.Of course, we had to be pragmatic and adapt our initial goal to what was achievable.But on the way we learned about strengths and weaknesses of nearest neighborand logistic regression.
We learned how to extract features such as LinkCount,NumTextTokens, NumCodeLines, AvgSentLen, AvgWordLen, NumAllCaps, NumExclams,and NumImages, and how to analyze their impact on the classifier's performance.But what is even more valuable is that we learned an informed way of how to debugbad performing classifiers. That will help us in the future to come up with usablesystems much faster.After having looked into nearest neighbor and logistic regression, in the nextchapter, we will get familiar with yet another simple yet powerful classificationalgorithm: Naïve Bayes.
Along the way, we will also learn some more convenienttools from scikit-learn.[ 121 ]Classification II – SentimentAnalysisFor companies, it is vital to closely monitor the public reception of key events, suchas product launches or press releases. With its real-time access and easy accessibilityof user-generated content on Twitter, it is now possible to do sentiment classificationof tweets. Sometimes also called opinion mining, it is an active field of research, inwhich several companies are already selling such services.
As this shows that thereobviously exists a market, we have motivation to use our classification muscles builtin the last chapter, to build our own home-grown sentiment classifier.Sketching our roadmapSentiment analysis of tweets is particularly hard, because of Twitter's size limitationof 140 characters. This leads to a special syntax, creative abbreviations, and seldomwell-formed sentences. The typical approach of analyzing sentences, aggregatingtheir sentiment information per paragraph, and then calculating the overallsentiment of a document does not work here.Clearly, we will not try to build a state-of-the-art sentiment classifier. Instead, wewant to:• Use this scenario as a vehicle to introduce yet another classificationalgorithm, Naïve Bayes• Explain how Part Of Speech (POS) tagging works and how it can help us• Show some more tricks from the scikit-learn toolbox that come in handy fromtime to time[ 123 ]Classification II – Sentiment AnalysisFetching the Twitter dataNaturally, we need tweets and their corresponding labels that tell whether a tweet iscontaining a positive, negative, or neutral sentiment.
In this chapter, we will use thecorpus from Niek Sanders, who has done an awesome job of manually labeling morethan 5,000 tweets and has granted us permission to use it in this chapter.To comply with Twitter's terms of services, we will not provide any data fromTwitter nor show any real tweets in this chapter. Instead, we can use Sander'shand-labeled data, which contains the tweet IDs and their hand-labeled sentiment,and use his script, install.py, to fetch the corresponding Twitter data. As the scriptis playing nice with Twitter's servers, it will take quite some time to download all thedata for more than 5,000 tweets. So it is a good idea to start it right away.The data comes with four sentiment labels:>>> X, Y = load_sanders_data()>>> classes = np.unique(Y)>>> for c in classes: print("#%s: %i" % (c, sum(Y==c)))#irrelevant: 490#negative: 487#neutral: 1952#positive: 433Inside load_sanders_data(), we are treating irrelevant and neutral labels togetheras neutral and drop ping all non-English tweets, resulting in 3,362 tweets.In case you get different counts here, it is because, in the meantime, tweets mighthave been deleted or set to be private.
In that case, you might also get slightlydifferent numbers and graphs than the ones shown in the upcoming sections.Introducing the Naïve Bayes classifierNaïve Bayes is probably one of the most elegant machine learning algorithms outthere that is of practical use. And despite its name, it is not that naïve when you lookat its classification performance. It proves to be quite robust to irrelevant features,which it kindly ignores. It learns fast and predicts equally so. It does not require lotsof storage. So, why is it then called naïve?[ 124 ]Chapter 6The Naïve was added to account for one assumption that is required for Naïve Bayesto work optimally.
The assumption is that the features do not impact each other.This, however, is rarely the case for real-world applications. Nevertheless, it stillreturns very good accuracy in practice even when the independence assumptiondoes not hold.Getting to know the Bayes' theoremAt its core, Naïve Bayes classification is nothing more than keeping track of whichfeature gives evidence to which class. The way the features are designed determinesthe model that is used to learn. The so-called Bernoulli model only cares aboutBoolean features: whether a word occurs only once or multiple times in a tweet doesnot matter.
In contrast, the Multinomial model uses word counts as features. Forthe sake of simplicity, we will use the Bernoulli model to explain how to use NaïveBayes for sentiment analysis. We will then use the Multinomial model later on to setup and tune our real-world classifiers.Let's assume the following meanings for the variables that we will use to explainNaïve Bayes:VariableMeaningThis is the class of a tweet (positive or negative)The word "awesome" occurs at least once in the tweetThe word "crazy" occurs at least once in the tweetDuring training, we learned the Naïve Bayes model, which is the probability fora class when we already know featuresand .
This probability is written as.Since we cannot estimateby Bayes:directly, we apply a trick, which was found out[ 125 ]Classification II – Sentiment AnalysisIf we substitute with the probability of both words "awesome" and "crazy", andthink of as being our class , we arrive at the relationship that helps us to laterretrieve the probability for the data instance belonging to the specified class:This allows us to expressby means of the other probabilities:We could also describe this as follows:The prior and the evidence are easily determined:••is the prior probability of class without knowing about the data. Wecan estimate this quantity by simply calculating the fraction of all trainingdata instances belonging to that particular class.is the evidence or the probability of featuresand..