Building machine learning systems with Python (779436), страница 19
Текст из файла (страница 19)
If 1, then itis a wiki question.TitleStringThis is the title of the question (missing foranswers).AcceptedAnswerIdIdThis is the ID for the accepted answer (missing foranswers).CommentCountIntegerThis is the number of comments for the post.This is the date of submission.This is the complete post as encoded HTML text.Slimming the data down to chewable chunksTo speed up our experimentation phase, we should not try to evaluate ourclassification ideas on the huge XML file. Instead, we should think of how we couldtrim it down so that we still keep a representable snapshot of it while being able toquickly test our ideas. If we filter the XML for row tags that have a creation date of,for example, 2012, we still end up with over 6 million posts (2,323,184 questions and4,055,999 answers), which should be enough to pick our training data from for now.We also do not want to operate on the XML format as it will slow us down, too.The simpler the format, the better.
That's why we parse the remaining XML usingPython's cElementTree and write it out to a tab-separated file.Preselection and processing of attributesTo cut down the data even more, we can certainly drop attributes that we think willnot help the classifier in distinguishing between good and not-so-good answers. Butwe have to be cautious here. Although some features are not directly impacting theclassification, they are still necessary to keep.The PostTypeId attribute, for example, is necessary to distinguish between questionsand answers. It will not be picked to serve as a feature, but we will need it to filterthe data.CreationDate could be interesting to determine the time span between posting thequestion and posting the individual answers, so we keep it.
The Score is of courseimportant as an indicator for the community's evaluation.[ 98 ]Chapter 5ViewCount, in contrast, is most likely of no use for our task. Even if it would help theclassifier to distinguish between good and bad, we would not have this informationat the time when an answer is being submitted. Drop it!The Body attribute obviously contains the most important information.
As it isencoded HTML, we will have to decode to plain text.OwnerUserId is only useful if we take user-dependent features in to account, whichwe won't. Although we drop it here, we encourage you to use it to build a betterclassifier (maybe in connection with stackoverflow.com-Users.7z).The Title attribute is also ignored here, although it could add some moreinformation about the question.CommentCount is also ignored.
Similar to ViewCount, it could help the classifierwith posts that are out there for a while (more comments = more ambiguous post?).It will, however, not help the classifier at the time an answer is posted.AcceptedAnswerId is similar to Score in that it is an indicator of a post's quality.As we will access this per answer, instead of keeping this attribute, we will createthe new attribute IsAccepted, which is 0 or 1 for answers and ignored for questions(ParentId=-1).We end up with the following format:Id <TAB> ParentId <TAB> IsAccepted <TAB> TimeToAnswer <TAB> Score<TAB> TextFor the concrete parsing details, please refer to so_xml_to_tsv.py and choose_instance.py.
Suffice to say that in order to speed up processing, we will split thedata into two files: in meta.json, we store a dictionary mapping a post's Id value toits other data except Text in JSON format so that we can read it in the proper format.For example, the score of a post would reside at meta[Id]['Score']. In data.tsv, westore the Id and Text values, which we can easily read with the following method:def fetch_posts():for line in open("data.tsv", "r"):post_id, text = line.split("\t")yield int(post_id), text.strip()[ 99 ]Classification – Detecting Poor AnswersDefining what is a good answerBefore we can train a classifier to distinguish between good and bad answers, wehave to create the training data. So far, we only have a bunch of data.
What we stillhave to do is define labels.We could, of course, simply use the IsAccepted attribute as a label. After all, thatmarks the answer that answered the question. However, that is only the opinionof the asker. Naturally, the asker wants to have a quick answer and accepts the firstbest answer. If over time more answers are submitted, some of them will tend tobe better than the already accepted one.
The asker, however, seldom gets back tothe question and changes his mind. So we end up with many questions that haveaccepted answers that are not scored highest.At the other extreme, we could simply always take the best and worst scored answerper question as positive and negative examples. However, what do we do withquestions that have only good answers, say, one with two and the other with fourpoints? Should we really take an answer with, for example, two points as a negativeexample just because it happened to be the one with the lower score?We should settle somewhere between these extremes. If we take all answers thatare scored higher than zero as positive and all answers with zero or less points asnegative, we end up with quite reasonable labels:>>> all_answers = [q for q,v in meta.items() if v['ParentId']!=-1]>>> Y = np.asarray([meta[answerId]['Score']>0 for answerId inall_answers])Creating our first classifierLet's start with the simple and beautiful nearest neighbor method from the previouschapter.
Although it is not as advanced as other methods, it is very powerful: as itis not model-based, it can learn nearly any data. But this beauty comes with a cleardisadvantage, which we will find out very soon.Starting with kNNThis time, we won't implement it ourselves, but rather take it from the sklearntoolkit. There, the classifier resides in sklearn.neighbors. Let's start with a simple2-Nearest Neighbor classifier:>>> from sklearn import neighbors>>> knn = neighbors.KNeighborsClassifier(n_neighbors=2)>>> print(knn)[ 100 ]Chapter 5KNeighborsClassifier(algorithm='auto', leaf_size=30,metric='minkowski', n_neighbors=2, p=2, weights='uniform')It provides the same interface as all other estimators in sklearn: we train it usingfit(), after which we can predict the class of new data instances using predict():>>> knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])>>> knn.predict(1.5)array([0])>>> knn.predict(37)array([1])>>> knn.predict(3)array([0])To get the class probabilities, we can use predict_proba().
In this case of havingtwo classes, 0 and 1, it will return an array of two elements:>>> knn.predict_proba(1.5)array([[ 1.,0.]])>>> knn.predict_proba(37)array([[ 0.,1.]])>>> knn.predict_proba(3.5)array([[ 0.5,0.5]])Engineering the featuresSo, what kind of features can we provide to our classifier? What do we think willhave the most discriminative power?TimeToAnswer is already there in our meta dictionary, but it probably won't providemuch value on its own.
Then there is only Text, but in its raw form, we cannot passit to the classifier, as the features must be in numerical form. We will have to do thedirty (and fun!) work of extracting features from it.What we could do is check the number of HTML links in the answer as a proxy forquality. Our hypothesis would be that more hyperlinks in an answer indicate betteranswers and thus a higher likelihood of being up-voted. Of course, we want to onlycount links in normal text and not code examples:import recode_match = re.compile('<pre>(.*?)</pre>',re.MULTILINE | re.DOTALL)link_match = re.compile('<a href="http://.*?".*?>(.*?)</a>',[ 101 ]Classification – Detecting Poor Answersre.MULTILINE | re.DOTALL)tag_match = re.compile('<[^>]*>',re.MULTILINE | re.DOTALL)def extract_features_from_body(s):link_count_in_code = 0# count links in code to later subtract themfor match_str in code_match.findall(s):link_count_in_code += len(link_match.findall(match_str))return len(link_match.findall(s)) – link_count_in_codeFor production systems, we would not want to parse HTMLcontent with regular expressions.
Instead, we should rely onexcellent libraries such as BeautifulSoup, which does a marvelousjob of robustly handling all the weird things that typically occur ineveryday HTML.With this in place, we can generate one feature per answer. But before we trainthe classifier, let's first have a look at what we will train it with. We can get a firstimpression with the frequency distribution of our new feature. This can be done byplotting the percentage of how often each value occurs in the data. Have a look at thefollowing plot:[ 102 ]Chapter 5With the majority of posts having no link at all, we know now that this feature willnot make a good classifier alone. Let's nevertheless try it out to get a first estimationof where we are.Training the classifierWe have to pass the feature array together with the previously defined labels Y to thekNN learner to obtain a classifier:X = np.asarray([extract_features_from_body(text) for post_id, text infetch_posts() if post_id in all_answers])knn = neighbors.KNeighborsClassifier()knn.fit(X, Y)Using the standard parameters, we just fitted a 5NN (meaning NN with k=5) toour data.
Why 5NN? Well, at the current state of our knowledge about the data, wereally have no clue what the right k should be. Once we have more insight, we willhave a better idea of how to set k.Measuring the classifier's performanceWe have to be clear about what we want to measure. The naïve but easiest way is tosimply calculate the average prediction quality over the test set. This will result in avalue between 0 for predicting everything wrongly and 1 for perfect prediction. Theaccuracy can be obtained through knn.score().But as we learned in the previous chapter, we will not do it just once, but applycross-validation here using the readymade KFold class from sklearn.cross_validation.