Building machine learning systems with Python (779436), страница 17
Текст из файла (страница 17)
This simply means that insteadof comparing word to word, we say that two documents are similar if they talk aboutthe same topics.This can be very powerful as two text documents that share few words may actuallyrefer to the same topic! They may just refer to it using different constructions (forexample, one document may read "the President of the United States" while the otherwill use the name "Barack Obama").Topic models are good on their own to build visualizationsand explore data. They are also very useful as an intermediatestep in many other tasks.[ 86 ]At this point, we can redo the exercise we performed in the last chapter and lookfor the most similar post to an input query, by using the topics to define similarity.Whereas, earlier we compared two documents by comparing their word vectorsdirectly, we can now compare two documents by comparing their topic vectors.For this, we are going to project the documents to the topic space.
That is, we wantto have a vector of topics that summarize the document. How to perform thesetypes of dimensionality reduction in general is an important task in itself and wehave a chapter entirely devoted to this task. For the moment, we just show howtopic models can be used for exactly this purpose; once topics have been computedfor each document, we can perform operations on its topic vector and forget aboutthe original words.
If the topics are meaningful, they will be potentially moreinformative than the raw words. Additionally, this may bring computationaladvantages, as it is much faster to compare 100 vectors of topic weights than vectorsof the size of the vocabulary (which will contain thousands of terms).Using gensim, we have seen earlier how to compute the topics corresponding to allthe documents in the corpus. We will now compute these for all the documents andstore it in a NumPy arrays and compute all pairwise distances:>>> from gensim import matutils>>> topics = matutils.corpus2dense(model[corpus],num_terms=model.num_topics)Now, topics is a matrix of topics. We can use the pdist function in SciPy tocompute all pairwise distances.
That is, with a single function call, we computeall the values of sum((topics[ti] – topics[tj])**2):>>> from scipy.spatial import distance>>> pairwise = distance.squareform(distance.pdist(topics))Now, we will employ one last little trick; we will set the diagonal elements of thedistance matrix to a high value (it just needs to be larger than the other values inthe matrix):>>> largest = pairwise.max()>>> for ti in range(len(topics)):...pairwise[ti,ti] = largest+1And we are done! For each document, we can look up the closest element easily (thisis a type of nearest neighbor classifier):>>> def closest_to(doc_id):...return pairwise[doc_id].argmin()[ 87 ]Topic ModelingNote that this will not work if we had not set the diagonalelements to a large value: the function will always return thesame element as it is the one most similar to itself (except inthe weird case where two elements had exactly the same topicdistribution, which is very rare unless they are exactly the same).For example, here is one possible query document (it is the second document inour collection):From: geb@cs.pitt.edu (Gordon Banks)Subject: Re: request for information on "essential tremor" andIndrol?In article <1q1tbnINNnfn@life.ai.mit.edu> sundar@ai.mit.eduwrites:Essential tremor is a progressive hereditary tremor that getsworsewhen the patient tries to use the effected member.
All limbs,vocalcords, and head can be involved. Inderal is a beta-blocker andis usually effective in diminishing the tremor. Alcohol andmysolineare also effective, but alcohol is too toxic to use as atreatment.---------------------------------------------------------------------------Gordon Banks N3JXP| "Skepticism is the chastity of theintellect, andgeb@cadre.dsl.pitt.edu| it is shameful to surrender it toosoon."---------------------------------------------------------------------------If we ask for the most similar document to closest_to(1), we receive the followingdocument as a result:From: geb@cs.pitt.edu (Gordon Banks)Subject: Re: High ProlactinIn article <93088.112203JER4@psuvm.psu.edu> JER4@psuvm.psu.edu(John E. Rodway) writes:>Any comments on the use of the drug Parlodel for high prolactinin the blood?[ 88 ]>It can suppress secretion of prolactin. Is useful in cases ofgalactorrhea.Some adenomas of the pituitary secret too much.---------------------------------------------------------------------------Gordon Banks N3JXP| "Skepticism is the chastity of theintellect, andgeb@cadre.dsl.pitt.edu| it is shameful to surrender it toosoon."The system returns a post by the same author discussing medications.Modeling the whole of WikipediaWhile the initial LDA implementations can be slow, which limited their use to smalldocument collections, modern algorithms work well with very large collections ofdata.
Following the documentation of gensim, we are going to build a topic modelfor the whole of the English-language Wikipedia. This takes hours, but can be doneeven with just a laptop! With a cluster of machines, we can make it go much faster,but we will look at that sort of processing environment in a later chapter.First, we download the whole Wikipedia dump from http://dumps.wikimedia.org. This is a large file (currently over 10 GB), so it may take a while, unless yourInternet connection is very fast. Then, we will index it with a gensim tool:python -m gensim.scripts.make_wiki \enwiki-latest-pages-articles.xml.bz2 wiki_en_outputRun the previous line on the command shell, not on the Python shell.
After a fewhours, the index will be saved in the same directory. At this point, we can buildthe final topic model. This process looks exactly like what we did for the small APdataset. We first import a few packages:>>> import logging, gensimNow, we set up logging, using the standard Python logging module (which gensimuses to print out status messages). This step is not strictly necessary, but it is nice tohave a little more output to know what is happening:>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)[ 89 ]Topic ModelingNow we load the preprocessed data:>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_output_wordids.txt')>>> mm = gensim.corpora.MmCorpus('wiki_en_output_tfidf.mm')Finally, we build the LDA model as we did earlier:>>> model = gensim.models.ldamodel.LdaModel(corpus=mm,id2word=id2word,num_topics=100,update_every=1,chunksize=10000,passes=1)This will again take a couple of hours.
You will see the progress on your console,which can give you an indication of how long you still have to wait.Once it is done, we can save the topic model to a file, so we don't have to redo it:>>> model.save('wiki_lda.pkl')If you exit your session and come back later, you can load the model again using thefollowing command (after the appropriate imports, naturally):>>> model = gensim.models.ldamodel.LdaModel.load('wiki_lda.pkl')The model object can be used to explore the collection of documents, and build thetopics matrix as we did earlier.We can see that this is still a sparse model even if we have many more documentsthan we had earlier (over 4 million as we are writing this):>>> lens = (topics > 0).sum(axis=0)>>> print(np.mean(lens))6.41>>> print(np.mean(lens <= 10))0.941So, the average document mentions 6.4 topics and 94 percent of them mention 10 orfewer topics.[ 90 ]We can ask what the most talked about topic in Wikipedia is.
We will first computethe total weight for each topic (by summing up the weights from all the documents)and then retrieve the words corresponding to the most highly weighted topic. This isperformed using the following code:>>> weights = topics.sum(axis=0)>>> words = model.show_topic(weights.argmax(), 64)Using the same tools as we did earlier to build up a visualization, we can see thatthe most talked about topic is related to music and is a very coherent topic. A full 18percent of Wikipedia pages are partially related to this topic (5.5 percent of all thewords in Wikipedia are assigned to this topic).
Take a look at the following screenshot:These plots and numbers were obtained when the book wasbeing written. As Wikipedia keeps changing, your results willbe different. We expect that the trends will be similar, but thedetails may vary.[ 91 ]Topic ModelingAlternatively, we can look at the least talked about topic:>>> words = model.show_topic(weights.argmin(), 64)The least talked about topic is harder to interpret, but many of its top words referto airports in eastern countries. Just 1.6 percent of documents touch upon it, and itrepresents just 0.1 percent of the words.Choosing the number of topicsSo far in the chapter, we have used a fixed number of topics for our analyses, namely100.
This was a purely arbitrary number, we could have just as well used either 20or 200 topics. Fortunately, for many uses, this number does not really matter. If youare going to only use the topics as an intermediate step, as we did previously whenfinding similar posts, the final behavior of the system is rarely very sensitive to theexact number of topics used in the model.
This means that as long as you use enoughtopics, whether you use 100 topics or 200, the recommendations that result from theprocess will not be very different; 100 is often a good enough number (while 20 is toofew for a general collection of text documents). The same is true of setting the alphavalue. While playing around with it can change the topics, the final results are againrobust against this change.[ 92 ]Topic modeling is often an end towards a goal. In that case, itis not always very important exactly which parameter valuesare used.