Building machine learning systems with Python (779436), страница 18
Текст из файла (страница 18)
A different number of topics or values for parameterssuch as alpha will result in systems whose end results arealmost identical in their final results.On the other hand, if you are going to explore the topics directly, or build avisualization tool that exposes them, you should probably try a few valuesand see which gives you the most useful or most appealing results.Alternatively, there are a few methods that will automatically determine thenumber of topics for you, depending on the dataset. One popular model is calledthe hierarchical Dirichlet process.
Again, the full mathematical model behind it iscomplex and beyond the scope of this book. However, the fable we can tell is thatinstead of having the topics fixed first as in the LDA generative story, the topicsthemselves were generated along with the data, one at a time. Whenever the writerstarts a new document, they have the option of using the topics that already exist orto create a completely new one. When more topics have already been created, theprobability of creating a new one, instead of reusing what exists goes down, but thepossibility always exists.This means that the more documents we have, the more topics we will end up with.
Thisis one of those statements that is unintuitive at first but makes perfect sense uponreflection. We are grouping documents and the more examples we have, the morewe can break them up. If we only have a few examples of news articles, then "Sports"will be a topic. However, as we have more, we start to break it up into the individualmodalities: "Hockey", "Soccer", and so on. As we have even more data, we can startto tell nuances apart, articles about individual teams and even individual players.The same is true for people.
In a group of many different backgrounds, with a few"computer people", you might put them together; in a slightly larger group, you willhave separate gatherings for programmers and systems administrators; and in thereal-world, we even have different gatherings for Python and Ruby programmers.The hierarchical Dirichlet process (HDP) is available in gensim. Using it is trivial.To adapt the code we wrote for LDA, we just need to replace the call to gensim.models.ldamodel.LdaModel with a call to the HdpModel constructor as follows:>>> hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)That's it (except that it takes a bit longer to compute—there are no free lunches).Now, we can use this model in much the same way as we used the LDA model,except that we did not need to specify the number of topics.[ 93 ]Topic ModelingSummaryIn this chapter, we discussed topic modeling.
Topic modeling is more flexible thanclustering as these methods allow each document to be partially present in morethan one group. To explore these methods, we used a new package, gensim.Topic modeling was first developed and is easier to understand in the case of text,but in the computer vision chapter we will see how some of these techniques maybe applied to images as well. Topic models are very important in modern computervision research. In fact, unlike the previous chapters, this chapter was very close to thecutting edge of research in machine learning algorithms. The original LDA algorithmwas published in a scientific journal in 2003, but the method that gensim uses to beable to handle Wikipedia was only developed in 2010 and the HDP algorithm is from2011.
The research continues and you can find many variations and models withwonderful names such as the Indian buffet process (not to be confused with the Chineserestaurant process, which is a different model), or Pachinko allocation (Pachinko being atype of Japanese game, a cross between a slot-machine and pinball).We have now gone through some of the major machine learning modes:classification, clustering, and topic modeling.In the next chapter, we go back to classification, but this time, we will be exploringadvanced algorithms and approaches.[ 94 ]Classification – DetectingPoor AnswersNow that we are able to extract useful features from text, we can take on thechallenge of building a classifier using real data. Let's come back to our imaginarywebsite in Chapter 3, Clustering – Finding Related Posts, where users can submitquestions and get them answered.A continuous challenge for owners of those Q&A sites is to maintain a decent level ofquality in the posted content.
Sites such as StackOverflow make considerable effortsto encourage users with diverse possibilities to score content and offer badges andbonus points in order to encourage the users to spend more energy on carving outthe question or crafting a possible answer.One particular successful incentive is the ability for the asker to flag one answerto their question as the accepted answer (again there are incentives for the askerto flag answers as such). This will result in more score points for the author ofthe flagged answer.Would it not be very useful to the user to immediately see how good his answer iswhile he is typing it in? That means, the website would continuously evaluate hiswork-in-progress answer and provide feedback as to whether the answer showssome signs of a poor one.
This will encourage the user to put more effort into writingthe answer (providing a code example? including an image?), and thus improve theoverall system.Let's build such a mechanism in this chapter.[ 95 ]Classification – Detecting Poor AnswersSketching our roadmapAs we will build a system using real data that is very noisy, this chapter is not for thefainthearted, as we will not arrive at the golden solution of a classifier that achieves100 percent accuracy; often, even humans disagree whether an answer was goodor not (just look at some of the StackOverflow comments).
Quite the contrary, wewill find out that some problems like this one are so hard that we have to adjustour initial goals on the way. But on the way, we will start with the nearest neighborapproach, find out why it is not very good for the task, switch over to logisticregression, and arrive at a solution that will achieve good enough prediction quality,but on a smaller part of the answers. Finally, we will spend some time looking athow to extract the winner to deploy it on the target system.Learning to classify classy answersIn classification, we want to find the corresponding classes, sometimes also calledlabels, for given data instances.
To be able to achieve this, we need to answertwo questions:• How should we represent the data instances?• Which model or structure should our classifier possess?Tuning the instanceIn its simplest form, in our case, the data instance is the text of the answer and thelabel would be a binary value indicating whether the asker accepted this text as ananswer or not.
Raw text, however, is a very inconvenient representation to processfor most machine learning algorithms. They want numbers. And it will be our task toextract useful features from the raw text, which the machine learning algorithm canthen use to learn the right label for it.Tuning the classifierOnce we have found or collected enough (text, label) pairs, we can train a classifier.For the underlying structure of the classifier, we have a wide range of possibilities,each of them having advantages and drawbacks.
Just to name some of the moreprominent choices, there are logistic regression, decision trees, SVMs, and NaïveBayes. In this chapter, we will contrast the instance-based method from the lastchapter, nearest neighbor, with model-based logistic regression.[ 96 ]Chapter 5Fetching the dataLuckily for us, the team behind StackOverflow provides most of the data behind theStackExchange universe to which StackOverflow belongs under a cc-wiki license.At the time of writing this book, the latest data dump can be found at https://archive.org/details/stackexchange.
It contains data dumps of all Q&A sites ofthe StackExchange family. For StackOverflow, you will find multiple files, of whichwe only need the stackoverflow.com-Posts.7z file, which is 5.2 GB.After downloading and extracting it, we have around 26 GB of data in the format ofXML, containing all questions and answers as individual row tags within the roottag posts:<?xml version="1.0" encoding="utf-8"?><posts>...<row Id="4572748" PostTypeId="2" ParentId="4568987"CreationDate="2011-01-01T00:01:03.387" Score="4" ViewCount=""Body="<p>IANAL, but <ahref="http://support.apple.com/kb/HT2931"rel="nofollow">this</a> indicates to me that youcannot use the loops in yourapplication:</p> <blockquote> <p>...however, individual audio loops may not becommercially or otherwise distributed on a standalone basis,nor may they be repackaged in whole or in part as audiosamples, sound effects or music beds."</p> <p>So don't worry, you can make commercial musicwith GarageBand, you just can't distribute the loops as loops.</p> </blockquote> " OwnerUserId="203568"LastActivityDate="2011-01-01T00:01:03.387" CommentCount="1" />…</posts>NameIdTypeIntegerDescriptionPostTypeIdIntegerThis describes the category of the post.
The valuesinteresting to us are the following:This is a unique identifier.• Question• AnswerOther values will be ignored.ParentIdIntegerThis is a unique identifier of the question to whichthis answer belongs (missing for questions).[ 97 ]Classification – Detecting Poor AnswersNameCreationDateTypeDateTimeDescriptionScoreIntegerThis is the score of the post.ViewCountThis is the number of user views for this post.BodyIntegeror emptyStringOwnerUserIdIdThis is a unique identifier of the poster.