Building machine learning systems with Python (779436), страница 16
Текст из файла (страница 16)
This is a standard dataset for text modeling research, which was used insome of the initial works on topic models. After downloading the data, we can loadit by running the following code:>>> from gensim import corpora, models>>> corpus = corpora.BleiCorpus('./data/ap/ap.dat','./data/ap/vocab.txt')The corpus variable holds all of the text documents and has loaded them in aformat that makes for easy processing. We can now build a topic model usingthis object as input:>>> model = models.ldamodel.LdaModel(corpus,num_topics=100,id2word=corpus.id2word)This single constructor call will statistically infer which topics are present in thecorpus.
We can explore the resulting model in many ways. We can see the listof topics a document refers to using the model[doc] syntax, as shown in thefollowing example:>>> doc = corpus.docbyoffset(0)>>> topics = model[doc]>>> print(topics)[(3, 0.023607255776894751),(13, 0.11679936618551275),(19, 0.075935855202707139),....(92, 0.10781541687001292)][ 81 ]Topic ModelingThe result will almost surely look different on our computer! The learning algorithmuses some random numbers and every time you learn a new topic model on thesame input data, the result is different.
Some of the qualitative properties of themodel will be stable across different runs if your data is well behaved. For example,if you are using the topics to compare documents, as we do here, then the similaritiesshould be robust and change only slightly. On the other hand, the order of thedifferent topics will be completely different.The format of the result is a list of pairs: (topic_index, topic_weight). We cansee that only a few topics are used for each document (in the preceding example,there is no weight for topics 0, 1, and 2; the weight for those topics is 0).
The topicmodel is a sparse model, as although there are many possible topics; for eachdocument, only a few of them are used. This is not strictly true as all the topicshave a nonzero probability in the LDA model, but some of them have such a smallprobability that we can round it to zero as a good approximation.We can explore this further by plotting a histogram of the number of topics that eachdocument refers to:>>> num_topics_used = [len(model[doc]) for doc in corpus]>>> plt.hist(num_topics_used)You will get the following plot:[ 82 ]Sparsity means that while you may have large matrices and vectors,in principle, most of the values are zero (or so small that we can roundthem to zero as a good approximation).
Therefore, only a few things arerelevant at any given time.Often problems that seem too big to solve are actually feasible becausethe data is sparse. For example, even though any web page can link toany other web page, the graph of links is actually very sparse as eachweb page will link to a very tiny fraction of all other web pages.In the preceding graph, we can see that about 150 documents have 5 topics, while themajority deals with around 10 to 12 of them. No document talks about more than 20different topics.To a large extent, this is due to the value of the parameters that were used, namely,the alpha parameter.
The exact meaning of alpha is a bit abstract, but bigger valuesfor alpha will result in more topics per document.Alpha needs to be a value greater than zero, but is typically set to a lesser value,usually, less than one. The smaller the value of alpha, the fewer topics eachdocument will be expected to discuss. By default, gensim will set alpha to 1/num_topics, but you can set it explicitly by passing it as an argument in the LdaModelconstructor as follows:>>> model = models.ldamodel.LdaModel(corpus,num_topics=100,id2word=corpus.id2word,alpha=1)[ 83 ]Topic ModelingIn this case, this is a larger alpha than the default, which should lead to more topicsper document. As we can see in the combined histogram given next, gensim behavesas we expected and assigns more topics to each document:Now, we can see in the preceding histogram that many documents touch upon20 to 25 different topics.
If you set the value lower, you will observe the opposite(downloading the code from the online repository will allow you to play aroundwith these values).What are these topics? Technically, as we discussed earlier, they are multinomialdistributions over words, which means that they assign a probability to each word inthe vocabulary. Words with high probability are more associated with that topic thanwords with lower probability.Our brains are not very good at reasoning with probability distributions, but we canreadily make sense of a list of words.
Therefore, it is typical to summarize topics bythe list of the most highly weighted words.[ 84 ]In the following table, we display the first ten topics:Topic no.Topic1dress military soviet president new state capt carlucci states leader stancegovernment2koch zambia lusaka oneparty orange kochs party i government mayor newpolitical3human turkey rights abuses royal thompson threats new state wrote gardenpresident4bill employees experiments levin taxation federal measure legislation senatepresident whistleblowers sponsor5ohio july drought jesus disaster percent hartford mississippi crops northernvalley virginia6united percent billion year president world years states people i bush news7b hughes affidavit states united ounces squarefoot care delaying chargedunrealistic bush8yeutter dukakis bush convention farm subsidies uruguay percent secretarygeneral i told9kashmir government people srinagar india dumps city two jammukashmirgroup moslem pakistan10workers vietnamese irish wage immigrants percent bargaining last islandpolice hutton IAlthough daunting at first glance, when reading through the list of words, we canclearly see that the topics are not just random words, but instead these are logicalgroups.
We can also see that these topics refer to older news items, from when theSoviet Union still existed and Gorbachev was its Secretary General. We can alsorepresent the topics as word clouds, making more likely words larger. For example,this is the visualization of a topic which deals with the Middle East and politics:[ 85 ]Topic ModelingWe can also see that some of the words should perhaps be removed (for example, theword "I") as they are not so informative, they are stop words.
When building topicmodeling, it can be useful to filter out stop words, as otherwise, you might end upwith a topic consisting entirely of stop words. We may also wish to preprocess thetext to stems in order to normalize plurals and verb forms. This process was coveredin the previous chapter and you can refer to it for details. If you are interested, youcan download the code from the companion website of the book and try all thesevariations to draw different pictures.Building a word cloud like the previous one can be done withseveral different pieces of software. For the graphics in thischapter, we used a Python-based tool called pytagcloud. Thispackage requires a few dependencies to install and is not centralto machine learning, so we won't consider it in the main text;however, we have all of the code available in the online coderepository to generate the figures in this chapter.Comparing documents by topicsTopics can be useful on their own to build the sort of small vignettes with words thatare shown in the previous screenshot.
These visualizations can be used to navigate alarge collection of documents. For example, a website can display the different topicsas different word clouds, allowing a user to click through to the documents. In fact,they have been used in just this way to analyze large collections of documents.However, topics are often just an intermediate tool to another end. Now that wehave an estimate for each document of how much of that document comes from eachtopic, we can compare the documents in topic space.