Building machine learning systems with Python (779436), страница 20
Текст из файла (страница 20)
Finally, we will then average the scores on the test set of each foldand see how much it varies using standard deviation:from sklearn.cross_validation import KFoldscores = []cv = KFold(n=len(X), k=10, indices=True)for train, test in cv:X_train, y_train = X[train], Y[train]X_test, y_test = X[test], Y[test]clf = neighbors.KNeighborsClassifier()clf.fit(X, Y)[ 103 ]Classification – Detecting Poor Answersscores.append(clf.score(X_test, y_test))print("Mean(scores)=%.5f\tStddev(scores)=%.5f"\%(np.mean(scores), np.std(scores)))Here is the output:Mean(scores)=0.50250Stddev(scores)=0.055591Now that is far from being usable.
With only 55 percent accuracy, it is not muchbetter than tossing a coin. Apparently, the number of links in a post is not a verygood indicator for the quality of a post. So, we can say that this feature does nothave much discriminative power—at least not for kNN with k=5.Designing more featuresIn addition to using the number of hyperlinks as a proxy for a post's quality, thenumber of code lines is possibly another good one, too. At least it is a good indicatorthat the post's author is interested in answering the question. We can find the codeembedded in the <pre>…</pre> tag. And once we have it extracted, we should countthe number of words in the post while ignoring code lines:def extract_features_from_body(s):num_code_lines = 0link_count_in_code = 0code_free_s = s# remove source code and count how many linesfor match_str in code_match.findall(s):num_code_lines += match_str.count('\n')code_free_s = code_match.sub("", code_free_s)# Sometimes source code contains links,# which we don't want to countlink_count_in_code += len(link_match.findall(match_str))links = link_match.findall(s)link_count = len(links)link_count -= link_count_in_codehtml_free_s = re.sub(" +", " ",[ 104 ]Chapter 5tag_match.sub('',code_free_s)).replace("\n", "")link_free_s = html_free_s# remove links from text before counting wordsfor link in links:if link.lower().startswith("http://"):link_free_s = link_free_s.replace(link,'')num_text_tokens = html_free_s.count(" ")return num_text_tokens, num_code_lines, link_countLooking at them, we notice that at least the number of words in a post showshigher variability:Training on the bigger feature space improves accuracy quite a bit:Mean(scores)=0.59800Stddev(scores)=0.02600[ 105 ]Classification – Detecting Poor AnswersBut still, this would mean that we would classify roughly 4 out of 10 wrong.
At leastwe are going in the right direction. More features lead to higher accuracy, whichleads us to adding more features. Therefore, let's extend the feature space by evenmore features:• AvgSentLen: This measures the average number of words in a sentence.Maybe there is a pattern that particularly good posts don't overload thereader's brain with overly long sentences?• AvgWordLen: Similar to AvgSentLen, this feature measures the averagenumber of characters in the words of a post.• NumAllCaps: This measures the number of words that are written inuppercase, which is considered bad style.• NumExclams: This measures the number of exclamation marks.The following charts show the value distributions for average sentence and wordlengths and number of uppercase words and exclamation marks:[ 106 ]Chapter 5With these four additional features, we now have seven features representing theindividual posts.
Let's see how we progress:Mean(scores)=0.61400Stddev(scores)= 0.02154Now, that's interesting. We added four more features and don't get anything inreturn. How can that be?To understand this, we have to remind ourselves how kNN works. Our 5NNclassifier determines the class of a new post by calculating the seven aforementionedfeatures, LinkCount, NumTextTokens, NumCodeLines, AvgSentLen, AvgWordLen,NumAllCaps, and NumExclams, and then finds the five nearest other posts. The newpost's class is then the majority of the classes of those nearest posts.
The nearestposts are determined by calculating the Euclidean distance (as we did not specifyit, the classifier was initialized with the default p=2, which is the parameter in theMinkowski distance). That means that all seven features are treated similarly. kNNdoes not really learn that, for instance, NumTextTokens is good to have but much lessimportant than NumLinks.
Let's consider the following two posts A and B that onlydiffer in the following features and how they compare to a new post:PostNumLinksNumTextTokensA220B025new123Although we would think that links provide more value than mere text, post Bwould be considered more similar to the new post than post A.Clearly, kNN has a hard time in correctly using the available data.Deciding how to improveTo improve on this, we basically have the following options:• Add more data: Maybe it is just not enough data for the learning algorithmand we should simply add more training data?• Play with the model complexity: Maybe the model is not complex enough?Or maybe it is already too complex? In this case, we could decrease k sothat it would take less nearest neighbors into account and thus be better inpredicting non-smooth data.
Or we could increase it to achieve the opposite.[ 107 ]Classification – Detecting Poor Answers• Modify the feature space: Maybe we do not have the right set of features?We could, for example, change the scale of our current features or designeven more new features. Or should we rather remove some of our currentfeatures in case some features are aliasing others?• Change the model: Maybe kNN is in general not a good fit for our use casesuch that it will never be capable of achieving good prediction performance,no matter how complex we allow it to be and how sophisticated the featurespace will become?In real life, at this point, people often try to improve the current performance byrandomly picking one of the these options and trying them out in no particularorder, hoping to find the golden configuration by chance. We could do the samehere, but it will surely take longer than making informed decisions.
Let's take theinformed route, for which we need to introduce the bias-variance tradeoff.Bias-variance and their tradeoffIn Chapter 1, Getting Started with Python Machine Learning, we tried to fit polynomialsof different complexities controlled by the dimensionality parameter d to fit thedata. We realized that a two-dimensional polynomial, a straight line, does not fit theexample data very well, because the data was not of linear nature. No matter howelaborate our fitting procedure would have been, our two-dimensional model wouldsee everything as a straight line.
We say that it is too biased for the data at hand. It isunder-fitting.We played a bit with the dimensions and found out that the 100-dimensionalpolynomial is actually fitting very well to the data on which it was trained (we didnot know about train-test splits at that time). However, we quickly found out that itwas fitting too well. We realized that it was over-fitting so badly, that with differentsamples of the data points, we would have gotten totally different 100-dimensionalpolynomials. We say that the model has a too high variance for the given data, orthat it is over-fitting.These are the extremes between which most of our machine learning problemsreside.
Ideally, we want to have both, low bias and low variance. But, we are ina bad world, and have to tradeoff between them. If we improve on one, we willlikely get worse on the other.Fixing high biasLet's now assume we suffer from high bias. In that case, adding more training dataclearly does not help. Also, removing features surely will not help, as our modelwould have already been overly simplistic.[ 108 ]Chapter 5The only possibilities we have in this case are to get more features, make the modelmore complex, or change the model.Fixing high varianceIf, on the contrary, we suffer from high variance, that means that our model is toocomplex for the data. In this case, we can only try to get more data or decrease thecomplexity.
This would mean to increase k so that more neighbors would be takeninto account or to remove some of the features.High bias or low biasTo find out what our problem actually is, we have to simply plot the train and testerrors over the data size.High bias is typically revealed by the test error decreasing a bit at the beginning, butthen settling at a very high value with the train error approaching with a growingdataset size.
High variance is recognized by a big gap between both curves.Plotting the errors for different dataset sizes for 5NN shows a big gap between trainand test errors, hinting at a high variance problem:[ 109 ]Classification – Detecting Poor AnswersLooking at the graph, we immediately see that adding more training data willnot help, as the dashed line corresponding to the test error seems to stay above 0.4.The only option we have is to decrease the complexity, either by increasing k or byreducing the feature space.Reducing the feature space does not help here.
We can easily confirm this by plottingthe graph for a simplified feature space of only LinkCount and NumTextTokens:We get similar graphs for other smaller feature sets. No matter what subset offeatures we take, the graph would look similar.At least reducing the model complexity by increasing k shows some positive impact:kmean(scores)stddev(scores)400.628000.03750100.620000.0411150.614000.02154[ 110 ]Chapter 5But it is not enough, and also comes at a price of lower classification runtimeperformance.
Take, for instance, k=40, where we have a very low test error. Toclassify a new post, we would need to find the 40 nearest other posts to decidewhether the new post is a good one or not:Clearly, it seems to be an issue with using nearest neighbor for our scenario. And ithas another real disadvantage. Over time, we will get more and more posts into oursystem. As the nearest neighbor method is an instance-based approach, we will haveto store all posts in our system. The more we get, the slower the prediction will be.This is different with model-based approaches, where one tries to derive a modelfrom the data.There we are, with enough reasons now to abandon the nearest neighbor approachto look for better places in the classification world.