Building machine learning systems with Python (779436), страница 24
Текст из файла (страница 24)
There aredifferent kinds of Naïve Bayes classifiers:• GaussianNB: This classifier assumes the features to be normally distributed(Gaussian). One use case for it could be the classification of sex given theheight and width of a person. In our case, we are given tweet texts fromwhich we extract word counts. These are clearly not Gaussian distributed.[ 134 ]Chapter 6• MultinomialNB: This classifier assumes the features to be occurrence counts,which is our case going forward, since we will be using word counts inthe tweets as features.
In practice, this classifier also works well withTF-IDF vectors.• BernoulliNB: This classifier is similar to MultinomialNB, but more suitedwhen using binary word occurrences and not word counts.As we will mainly look at the word occurrences, for our purpose the MultinomialNBclassifier is best suited.Solving an easy problem firstAs we have seen, when we looked at our tweet data, the tweets are not only positiveor negative. The majority of tweets actually do not contain any sentiment, but areneutral or irrelevant, containing, for instance, raw information (for example, "Newbook: Building Machine Learning … http://link").
This leads to four classes. To notcomplicate the task too much, let's only focus on the positive and negative tweetsfor now.>>> # first create a Boolean list having true for tweets>>> # that are either positive or negative>>> pos_neg_idx = np.logical_or(Y=="positive", Y=="negative")>>> # now use that index to filter the data and the labels>>> X = X[pos_neg_idx]>>> Y = Y[pos_neg_idx]>>> # finally convert the labels themselves into Boolean>>> Y = Y=="positive"Now, we have in X the raw tweet texts and in Y the binary classification, 0 fornegative and 1 for positive tweets.We just said that we will use word occurrence counts as features.
We willnot use them in their raw form, though. Instead, we will use our power horseTfidfVectorizer to convert the raw tweet text into TF-IDF feature values, whichwe then use together with the labels to train our first classifier. For convenience, wewill use the Pipeline class, which allows us to hook the vectorizer and the classifiertogether and provides the same interface:from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNB[ 135 ]Classification II – Sentiment Analysisfrom sklearn.pipeline import Pipelinedef create_ngram_model():tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),analyzer="word", binary=False)clf = MultinomialNB()return Pipeline([('vect', tfidf_ngrams), ('clf', clf)])The Pipeline instance returned by create_ngram_model() can now be used to fitand predict as if we had a normal classifier.Since we do not have that much data, we should do cross-validation. This time,however, we will not use KFold, which partitions the data in consecutive folds, butinstead, we use ShuffleSplit.
It shuffles the data for us, but does not prevent thesame data instance to be in multiple folds. For each fold, then, we keep track of thearea under the Precision-Recall curve and for accuracy.To keep our experimentation agile, let's wrap everything together in a train_model()function, which takes a function as a parameter that creates the classifier.from sklearn.metrics import precision_recall_curve, aucfrom sklearn.cross_validation import ShuffleSplitdef train_model(clf_factory, X, Y):# setting random_state to get deterministic behaviorcv = ShuffleSplit(n=len(X), n_iter=10, test_size=0.3,random_state=0)scores = []pr_scores = []for train, test in cv:X_train, y_train = X[train], Y[train]X_test, y_test = X[test], Y[test]clf = clf_factory()clf.fit(X_train, y_train)train_score = clf.score(X_train, y_train)test_score = clf.score(X_test, y_test)scores.append(test_score)[ 136 ]Chapter 6proba = clf.predict_proba(X_test)precision, recall, pr_thresholds =precision_recall_curve(y_test, proba[:,1])pr_scores.append(auc(recall, precision))summary = (np.mean(scores), np.std(scores),np.mean(pr_scores), np.std(pr_scores))print("%.3f\t%.3f\t%.3f\t%.3f" % summary)Putting everything together, we can train our first model:>>> X, Y = load_sanders_data()>>> pos_neg_idx = np.logical_or(Y=="positive", Y=="negative")>>> X = X[pos_neg_idx]>>> Y = Y[pos_neg_idx]>>> Y = Y=="positive">>> train_model(create_ngram_model, X, Y)0.7880.0240.8820.036With our first try using Naïve Bayes on vectorized TF-IDF trigram features, we getan accuracy of 78.8 percent and an average P/R AUC of 88.2 percent.
Looking at theP/R chart of the median (the train/test split that is performing most similar to theaverage), it shows a much more encouraging behavior than the plots we have seen inthe previous chapter.[ 137 ]Classification II – Sentiment AnalysisFor a start, the results are quite encouraging. They get even more impressive whenwe realize that 100 percent accuracy is probably never achievable in a sentimentclassification task.
For some tweets, even humans often do not really agree on thesame classification label.Using all classesOnce again, we simplified our task a bit, since we used only positive or negativetweets. That means, we assumed a perfect classifier that upfront classified whetherthe tweet contains a sentiment and forwarded that to our Naïve Bayes classifier.So, how well do we perform if we also classify whether a tweet contains anysentiment at all? To find that out, let's first write a convenience function that returnsa modified class array providing a list of sentiments that we would like to interpretas positive:def tweak_labels(Y, pos_sent_list):pos = Y==pos_sent_list[0]for sent_label in pos_sent_list[1:]:pos |= Y==sent_labelY = np.zeros(Y.shape[0])Y[pos] = 1Y = Y.astype(int)return YNote that we are talking about two different positives now.
The sentiment of a tweetcan be positive, which is to be distinguished from the class of the training data. If, forexample, we want to find out how good we can separate tweets having sentimentfrom neutral ones, we could do:>>> Y = tweak_labels(Y, ["positive", "negative"])In Y we have now 1 (positive class) for all tweets that are either positive or negativeand 0 (negative class) for neutral and irrelevant ones.>>> train_model(create_ngram_model, X, Y, plot=True)0.7500.0120.6590.023[ 138 ]Chapter 6Have a look at the following plot:As expected, P/R AUC drops considerably, being only 66 percent now.
The accuracy isstill high, but that is only due to the fact that we have a highly imbalanced dataset. Outof 3,362 total tweets, only 920 are either positive or negative, which is about 27 percent.This means, if we create a classifier that always classifies a tweet as not containingany sentiment, we will already have an accuracy of 73 percent.
This is another case toalways look at precision and recall if the training and test data is unbalanced.So, how will the Naïve Bayes classifier perform on classifying positive tweets versusthe rest and negative tweets versus the rest? One word: bad.== Pos vs. rest ==0.8730.0090.305== Neg vs.
rest ==0.8610.0060.4970.0260.026[ 139 ]Classification II – Sentiment AnalysisPretty unusable if you ask me. Looking at the P/R curves in the following plots,we will also find no usable precision/recall trade-off, as we were able to do in thelast chapter:[ 140 ]Chapter 6Tuning the classifier's parametersCertainly, we have not explored the current setup enough and shouldinvestigate more. There are roughly two areas, where we can play with theknobs: TfidfVectorizer and MultinomialNB. As we have no real intuitionin which area we should explore, let's try to distribute the parameters' values.We will see the TfidfVectorizer parameter first:• Using different settings for NGrams:°°unigrams (1,1)°°unigrams and bigrams (1,2)°°unigrams, bigrams, and trigrams (1,3)• Playing with min_df: 1 or 2• Exploring the impact of IDF within TF-IDF using use_idf and smooth_idf:False or True• Whether to remove stop words or not, by setting stop_words to englishor None• Whether to use the logarithm of the word counts (sublinear_tf)• Whether to track word counts or simply track whether words occur or not,by setting binary to True or FalseNow we will see the MultinomialNB classifier:• Which smoothing method to use by setting alpha:°°Add-one or Laplace smoothing: 1°°Lidstone smoothing: 0.01, 0.05, 0.1, or 0.5°°No smoothing: 0A simple approach could be to train a classifier for all those reasonable explorationvalues, while keeping the other parameters constant and check the classifier's results.As we do not know whether those parameters affect each other, doing it right willrequire that we train a classifier for every possible combination of all parametervalues.
Obviously, this is too tedious for us.Because this kind of parameter exploration occurs frequently in machine learningtasks, scikit-learn has a dedicated class for it, called GridSearchCV. It takes anestimator (instance with a classifier-like interface), which will be the Pipelineinstance in our case, and a dictionary of parameters with their potential values.[ 141 ]Classification II – Sentiment AnalysisGridSearchCV expects the dictionary's keys to obey a certain format so that it is ableto set the parameters of the correct estimator. The format is as follows:<estimator>__<subestimator>__...__<param_name>For example, if we want to specify the desired values to explore for the min_dfparameter of TfidfVectorizer (named vect in the Pipeline description), wewould have to say:param_grid={"vect__ngram_range"=[(1, 1), (1, 2), (1, 3)]}This will tell GridSearchCV to try out unigrams to trigrams as parameter values forthe ngram_range parameter of TfidfVectorizer.Then, it trains the estimator with all possible parameter value combinations.Here, we make sure that it trains on random samples of the training data usingShuffleSplit, which generates an iterator of random train/test splits.