Building machine learning systems with Python (779436), страница 15
Текст из файла (страница 15)
An examplewill give us a quick impression of the noise that we have to expect. For the sake ofsimplicity, we will focus on one of the shorter posts:>>> post_group = zip(train_data.data, train_data.target)>>> all = [(len(post[0]), post[0], train_data.target_names[post[1]])for post in post_group]>>> graphics = sorted([post for post in all ifpost[2]=='comp.graphics'])>>> print(graphics[5])(245, 'From: SITUNAYA@IBM3090.BHAM.AC.UK\nSubject:test....(sorry)\nOrganization: The University of Birmingham, UnitedKingdom\nLines: 1\nNNTP-Posting-Host: ibm3090.bham.ac.uk<…snip…>','comp.graphics')For this post, there is no real indication that it belongs to comp.graphics consideringonly the wording that is left after the preprocessing step:>>> noise_post = graphics[5][1]>>> analyzer = vectorizer.build_analyzer()>>> print(list(analyzer(noise_post)))['situnaya', 'ibm3090', 'bham', 'ac', 'uk', 'subject', 'test','sorri', 'organ', 'univers', 'birmingham', 'unit', 'kingdom', 'line','nntp', 'post', 'host', 'ibm3090', 'bham', 'ac', 'uk']This is only after tokenization, lowercasing, and stop word removal.
If we alsosubtract those words that will be later filtered out via min_df and max_df, whichwill be done later in fit_transform, it gets even worse:>>> useful = set(analyzer(noise_post)).intersection(vectorizer.get_feature_names())>>> print(sorted(useful))['ac', 'birmingham', 'host', 'kingdom', 'nntp', 'sorri', 'test','uk', 'unit', 'univers'][ 75 ]Clustering – Finding Related PostsEven more, most of the words occur frequently in other posts as well, as wecan check with the IDF scores.
Remember that the higher TF-IDF, the morediscriminative a term is for a given post. As IDF is a multiplicative factor here,a low value of it signals that it is not of great value in general.>>> for term in sorted(useful):...print('IDF(%s)=%.2f'%(term,vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]))IDF(ac)=3.51IDF(birmingham)=6.77IDF(host)=1.74IDF(kingdom)=6.68IDF(nntp)=1.77IDF(sorri)=4.14IDF(test)=3.83IDF(uk)=3.70IDF(unit)=4.42IDF(univers)=1.91So, the terms with the highest discriminative power, birmingham and kingdom,clearly are not that computer graphics related, the same is the case with the termswith lower IDF scores.
Understandably, posts from different newsgroups will beclustered together.For our goal, however, this is no big deal, as we are only interested in cutting downthe number of posts that we have to compare a new post to. After all, the particularnewsgroup from where our training data came from is of no special interest.Tweaking the parametersSo what about all the other parameters? Can we tweak them to get better results?Sure. We can, of course, tweak the number of clusters, or play with the vectorizer'smax_features parameter (you should try that!).
Also, we can play with differentcluster center initializations. Then there are more exciting alternatives to K-meansitself. There are, for example, clustering approaches that let you even use differentsimilarity measurements, such as Cosine similarity, Pearson, or Jaccard. An excitingfield for you to play.[ 76 ]But before you go there, you will have to define what you actually mean by "better".SciKit has a complete package dedicated only to this definition. The package is calledsklearn.metrics and also contains a full range of different metrics to measureclustering quality.
Maybe that should be the first place to go now. Right into thesources of the metrics package.SummaryThat was a tough ride from pre-processing over clustering to a solution that canconvert noisy text into a meaningful concise vector representation, which we cancluster. If we look at the efforts we had to do to finally being able to cluster, it wasmore than half of the overall task.
But on the way, we learned quite a bit on textprocessing and how simple counting can get you very far in the noisy real-world data.The ride has been made much smoother, though, because of SciKit and its powerfulpackages. And there is more to explore. In this chapter, we were scratching thesurface of its capabilities. In the next chapters, we will see more of its power.[ 77 ]Topic ModelingIn the previous chapter, we grouped text documents using clustering. This is a veryuseful tool, but it is not always the best.
Clustering results in each text belongingto exactly one cluster. This book is about machine learning and Python. Should itbe grouped with other Python-related works or with machine-related works? In aphysical bookstore, we will need a single place to stock the book. In an Internet store,however, the answer is this book is about both machine learning and Python and the bookshould be listed in both the sections in an online bookstore. This does not mean thatthe book will be listed in all the sections, of course. We will not list this book withother baking books.In this chapter, we will learn methods that do not cluster documents into completelyseparate groups but allow each document to refer to several topics.
These topics willbe identified automatically from a collection of text documents. These documentsmay be whole books or shorter pieces of text such as a blogpost, a news story, oran e-mail.We would also like to be able to infer the fact that these documents may have topicsthat are central to them, while referring to other topics only in passing.
This bookmentions plotting every so often, but it is not a central topic as machine learning is.This means that documents have topics that are central to them and others that aremore peripheral. The subfield of machine learning that deals with these problems iscalled topic modeling and is the subject of this chapter.[ 79 ]Topic ModelingLatent Dirichlet allocationLDA and LDA—unfortunately, there are two methods in machine learning withthe initials LDA: latent Dirichlet allocation, which is a topic modeling method andlinear discriminant analysis, which is a classification method. They are completelyunrelated, except for the fact that the initials LDA can refer to either. In certainsituations, this can be confusing.
The scikit-learn tool has a submodule, sklearn.lda, which implements linear discriminant analysis. At the moment, scikit-learndoes not implement latent Dirichlet allocation.The topic model we will look at is latent Dirichlet allocation (LDA). The mathematicalideas behind LDA are fairly complex, and we will not go into the details here.For those who are interested, and adventurous enough, Wikipedia will provide allthe equations behind these algorithms: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation.However, we can understand the ideas behind LDA intuitively at a high-level.LDA belongs to a class of models that are called generative models as they have asort of fable, which explains how the data was generated.
This generative story isa simplification of reality, of course, to make machine learning easier. In the LDAfable, we first create topics by assigning probability weights to words. Each topicwill assign different weights to different words. For example, a Python topic willassign high probability to the word "variable" and a low probability to the word"inebriated". When we wish to generate a new document, we first choose the topics itwill use and then mix words from these topics.For example, let's say we have only three topics that books discuss:• Machine learning• Python• BakingFor each topic, we have a list of words associated with it. This book will be amixture of the first two topics, perhaps 50 percent each.
The mixture does not needto be equal, it can also be a 70/30 split. When we are generating the actual text, wegenerate word by word; first we decide which topic this word will come from. This isa random decision based on the topic weights. Once a topic is chosen, we generate aword from that topic's list of words. To be precise, we choose a word in English withthe probability given by the topic.In this model, the order of words does not matter. This is a bag of words model as wehave already seen in the previous chapter. It is a crude simplification of language,but it often works well enough, because just knowing which words were used in adocument and their frequencies are enough to make machine learning decisions.[ 80 ]In the real world, we do not know what the topics are.
Our task is to take a collectionof text and to reverse engineer this fable in order to discover what topics are outthere and simultaneously figure out which topics each document uses.Building a topic modelUnfortunately, scikit-learn does not support latent Dirichlet allocation. Therefore,we are going to use the gensim package in Python. Gensim is developed by RadimŘehůřek who is a machine learning researcher and consultant in the United Kingdom.We must start by installing it. We can achieve this by running the following command:pip install gensimAs input data, we are going to use a collection of news reports from the AssociatedPress (AP).