Building machine learning systems with Python (779436), страница 21
Текст из файла (страница 21)
Of course, we will never knowwhether there is the one golden feature we just did not happen to think of. But fornow, let's move on to another classification method that is known to work great intext-based classification scenarios.[ 111 ]Classification – Detecting Poor AnswersUsing logistic regressionContrary to its name, logistic regression is a classification method.
It is a verypowerful one when it comes to text-based classification; it achieves this by firstdoing a regression on a logistic function, hence the name.A bit of math with a small exampleTo get an initial understanding of the way logistic regression works, let's first takea look at the following example where we have artificial feature values X plottedwith the corresponding classes, 0 or 1. As we can see, the data is noisy such thatclasses overlap in the feature value range between 1 and 6. Therefore, it is better tonot directly model the discrete classes, but rather the probability that a feature valuebelongs to class 1, P(X).
Once we possess such a model, we could then predict class 1if P(X)>0.5, and class 0 otherwise.Mathematically, it is always difficult to model something that has a finite range,as is the case here with our discrete labels 0 and 1. We can, however, tweak theprobabilities a bit so that they always stay between 0 and 1.
And for that, we willneed the odds ratio and the logarithm of it.[ 112 ]Chapter 5Let's say a feature has the probability of 0.9 that it belongs to class 1, P(y=1) = 0.9. Theodds ratio is then P(y=1)/P(y=0) = 0.9/0.1 = 9. We could say that the chance is 9:1 thatthis feature maps to class 1. If P(y=0.5), we would consequently have a 1:1 chancethat the instance is of class 1. The odds ratio is bounded by 0, but goes to infinity(the left graph in the following set of graphs). If we now take the logarithm of it, wecan map all probabilities between 0 and 1 to the full range from negative to positiveinfinity (the right graph in the following set of graphs).
The nice thing is that we stillmaintain the relationship that higher probability leads to a higher log of odds, justnot limited to 0 and 1 anymore.This means that we can now fit linear combinations of our features (OK, we onlyvalues. In ahave one and a constant, but that will change soon) to thesense, we replace the linear from Chapter 1, Getting Started with Python Machine pi 1 − pi = c0 + c1 x (replacing y with log(odds)).1We can solve this for pi, so that we have pi =−( c0 + c1 xi ) .1+ eLearning, yi = c0 + c1 xi with log We simply have to find the right coefficients, such that the formula gives the lowesterrors for all our (xi, pi) pairs in our data set, but that will be done by scikit-learn.After fitting, the formula will give the probability for every new data point x thatbelongs to class 1:>>> from sklearn.linear_model import LogisticRegression>>> clf = LogisticRegression()>>> print(clf)[ 113 ]Classification – Detecting Poor AnswersLogisticRegression(C=1.0, class_weight=None, dual=False,fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001)>>> clf.fit(X, y)>>> print(np.exp(clf.intercept_), np.exp(clf.coef_.ravel()))[ 0.09437188] [ 1.80094112]>>> def lr_model(clf, X):...return 1 / (1 + np.exp(-(clf.intercept_ + clf.coef_*X)))>>> print("P(x=-1)=%.2f\tP(x=7)=%.2f"%(lr_model(clf, -1),lr_model(clf, 7)))P(x=-1)=0.05P(x=7)=0.85You might have noticed that scikit-learn exposes the first coefficient through thespecial field intercept_.If we plot the fitted model, we see that it makes perfect sense given the data:Applying logistic regression to our postclassification problemAdmittedly, the example in the previous section was created to show the beauty oflogistic regression.
How does it perform on the real, noisy data?[ 114 ]Chapter 5Comparing it to the best nearest neighbor classifier (k=40) as a baseline, we see that itperforms a bit better, but also won't change the situation a whole lot.Methodmean(scores)stddev(scores)LogReg C=0.10.646500.03139LogReg C=1.000.646500.03155LogReg C=10.000.645500.03102LogReg C=0.010.638500.0195040NN0.628000.03750We have shown the accuracy for different values of the regularization parameterC. With it, we can control the model complexity, similar to the parameter k for thenearest neighbor method.
Smaller values for C result in more penalization of themodel complexity.A quick look at the bias-variance chart for one of our best candidates, C=0.1, showsthat our model has high bias—test and train error curves approach closely but stayat unacceptable high values. This indicates that logistic regression with the currentfeature space is under-fitting and cannot learn a model that captures the data correctly:[ 115 ]Classification – Detecting Poor AnswersSo what now? We switched the model and tuned it as much as we could with ourcurrent state of knowledge, but we still have no acceptable classifier.More and more it seems that either the data is too noisy for this task or that our set offeatures is still not appropriate to discriminate the classes well enough.Looking behind accuracy – precision andrecallLet's step back and think again about what we are trying to achieve here.
Actually,we do not need a classifier that perfectly predicts good and bad answers as wemeasured it until now using accuracy. If we can tune the classifier to be particularlygood at predicting one class, we could adapt the feedback to the user accordingly. Ifwe, for example, had a classifier that was always right when it predicted an answerto be bad, we would give no feedback until the classifier detected the answer to bebad. On the contrary, if the classifier exceeded in predicting answers to be good, wecould show helpful comments to the user at the beginning and remove them whenthe classifier said that the answer is a good one.To find out in which situation we are here, we have to understand how to measureprecision and recall. And to understand that, we have to look into the four distinctclassification results as they are described in the following table:Classified asIn realityit isPositiveNegativePositiveTrue positive (TP)False negative (FN)NegativeFalse positive (FP)True negative (TN)For instance, if the classifier predicts an instance to be positive and the instanceindeed is positive in reality, this is a true positive instance.
If on the other hand theclassifier misclassified that instance, saying that it is negative while in reality it waspositive, that instance is said to be a false negative.What we want is to have a high success rate when we are predicting a post as eithergood or bad, but not necessarily both. That is, we want as much true positives aspossible. This is what precision captures:Precision =TPTP + FP[ 116 ]Chapter 5If instead our goal would have been to detect as much good or bad answers aspossible, we would be more interested in recall:Recall =TPTP + FNIn terms of the following graphic, precision is the fraction of the intersection of theright circle while recall is the fraction of the intersection of the left circle:So, how can we now optimize for precision? Up to now, we have always used 0.5as the threshold to decide whether an answer is good or not.
What we can do nowis count the number of TP, FP, and FN while varying that threshold between 0 and 1.With those counts, we can then plot precision over recall.The handy function precision_recall_curve() from the metrics module does allthe calculations for us:>>> from sklearn.metrics import precision_recall_curve>>> precision, recall, thresholds = precision_recall_curve(y_test,clf.predict(X_test))[ 117 ]Classification – Detecting Poor AnswersPredicting one class with acceptable performance does not always mean thatthe classifier is also acceptable predicting the other class.
This can be seen in thefollowing two plots, where we plot the precision/recall curves for classifying bad(the left graph) and good (the right graph) answers:In the graphs, we have also included a much better description ofa classifier's performance, the area under curve (AUC). It can beunderstood as the average precision of the classifier and is a greatway of comparing different classifiers.We see that we can basically forget predicting bad answers (the left plot). Precisiondrops to a very low recall and stays at an unacceptably low 60 percent.Predicting good answers, however, shows that we can get above 80 percent precisionat a recall of almost 40 percent.