Building machine learning systems with Python (779436), страница 13
Текст из файла (страница 13)
The classprovides several hooks with which we can customize the stage's preprocessingand tokenization. The preprocessor and tokenizer can be set as parameters in theconstructor. We do not want to place the stemmer into any of them, because wewill then have to do the tokenization and normalization by ourselves. Instead, weoverwrite the build_analyzer method:>>> import nltk.stem>>> english_stemmer = nltk.stem.SnowballStemmer('english'))>>> class StemmedCountVectorizer(CountVectorizer):...def build_analyzer(self):...analyzer = super(StemmedCountVectorizer,self).build_analyzer()...return lambda doc: (english_stemmer.stem(w) for w inanalyzer(doc))>>> vectorizer = StemmedCountVectorizer(min_df=1,stop_words='english')This will do the following process for each post:1. The first step is lower casing the raw post in the preprocessing step(done in the parent class).2.
Extracting all individual words in the tokenization step (done in theparent class).3. This concludes with converting each word into its stemmed version.As a result, we now have one feature less, because "images" and "imaging" collapsedto one. Now, the set of feature names is as follows:[u'actual', u'capabl', u'contain', u'data', u'databas', u'imag',u'interest', u'learn', u'machin', u'perman', u'post', u'provid',u'save', u'storag', u'store', u'stuff', u'toy']Running our new stemmed vectorizer over our posts, we see that collapsing"imaging" and "images", revealed that actually Post 2 is the most similar post to ournew post, as it contains the concept "imag" twice:=== Post 0 with dist=1.41: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=0.86: Imaging databases provide storagecapabilities.[ 62 ]=== Post 2 with dist=0.63: Most imaging databases save imagespermanently.=== Post 3 with dist=0.77: Imaging databases store data.=== Post 4 with dist=0.77: Imaging databases store data.
Imagingdatabases store data. Imaging databases store data.Best post is 2 with dist=0.63Stop words on steroidsNow that we have a reasonable way to extract a compact vector from a noisy textualpost, let's step back for a while to think about what the feature values actually mean.The feature values simply count occurrences of terms in a post. We silently assumedthat higher values for a term also mean that the term is of greater importance to thegiven post.
But what about, for instance, the word "subject", which naturally occursin each and every single post? Alright, we can tell CountVectorizer to remove itas well by means of its max_df parameter. We can, for instance, set it to 0.9 so thatall words that occur in more than 90 percent of all posts will always be ignored.But, what about words that appear in 89 percent of all posts? How low will we bewilling to set max_df? The problem is that however we set it, there will always be theproblem that some terms are just more discriminative than others.This can only be solved by counting term frequencies for every post and in additiondiscount those that appear in many posts.
In other words, we want a high value for agiven term in a given value, if that term occurs often in that particular post and veryseldom anywhere else.This is exactly what term frequency – inverse document frequency (TF-IDF)does. TF stands for the counting part, while IDF factors in the discounting. A naïveimplementation will look like this:>>> import scipy as sp>>> def tfidf(term, doc, corpus):...tf = doc.count(term) / len(doc)...num_docs_with_term = len([d for d in corpus if term in d])...idf = sp.log(len(corpus) / num_docs_with_term)...return tf * idfYou see that we did not simply count the terms, but also normalize the counts by thedocument length. This way, longer documents do not have an unfair advantage overshorter ones.[ 63 ]Clustering – Finding Related PostsFor the following documents, D, consisting of three already tokenized documents,we can see how the terms are treated differently, although all appear equally oftenper document:>>> a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]>>> D = [a, abb, abc]>>> print(tfidf("a", a, D))0.0>>> print(tfidf("a", abb, D))0.0>>> print(tfidf("a", abc, D))0.0>>> print(tfidf("b", abb, D))0.270310072072>>> print(tfidf("a", abc, D))0.0>>> print(tfidf("b", abc, D))0.135155036036>>> print(tfidf("c", abc, D))0.366204096223We see that a carries no meaning for any document since it is contained everywhere.The b term is more important for the document abb than for abc as it occurs there twice.In reality, there are more corner cases to handle than the preceding example does.Thanks to SciKit, we don't have to think of them as they are already nicely packagedin TfidfVectorizer, which is inherited from CountVectorizer.
Sure enough, wedon't want to miss our stemmer:>>> from sklearn.feature_extraction.text import TfidfVectorizer>>> class StemmedTfidfVectorizer(TfidfVectorizer):......def build_analyzer(self):analyzer = super(TfidfVectorizer,self).build_analyzer()...return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))>>> vectorizer = StemmedTfidfVectorizer(min_df=1,stop_words='english', decode_error='ignore')[ 64 ]The resulting document vectors will not contain counts any more. Instead they willcontain the individual TF-IDF values per term.Our achievements and goalsOur current text pre-processing phase includes the following steps:1. Firstly, tokenizing the text.2. This is followed by throwing away words that occur way too often to be ofany help in detecting relevant posts.3. Throwing away words that occur way so seldom so that there is only littlechance that they occur in future posts.4.
Counting the remaining words.5. Finally, calculating TF-IDF values from the counts, considering the wholetext corpus.Again, we can congratulate ourselves. With this process, we are able to convert abunch of noisy text into a concise representation of feature values.But, as simple and powerful the bag of words approach with its extensions is, it hassome drawbacks, which we should be aware of:• It does not cover word relations: With the aforementioned vectorizationapproach, the text "Car hits wall" and "Wall hits car" will both have thesame feature vector.• It does not capture negations correctly: For instance, the text "I will eatice cream" and "I will not eat ice cream" will look very similar by means oftheir feature vectors although they contain quite the opposite meaning.
Thisproblem, however, can be easily changed by not only counting individualwords, also called "unigrams", but instead also considering bigrams (pairs ofwords) or trigrams (three words in a row).• It totally fails with misspelled words: Although it is clear to the humanbeings among us readers that "database" and "databas" convey the samemeaning, our approach will treat them as totally different words.For brevity's sake, let's nevertheless stick with the current approach, which we cannow use to efficiently build clusters from.[ 65 ]Clustering – Finding Related PostsClusteringFinally, we have our vectors, which we believe capture the posts to a sufficient degree.Not surprisingly, there are many ways to group them together.
Most clusteringalgorithms fall into one of the two methods: flat and hierarchical clustering.Flat clustering divides the posts into a set of clusters without relating the clusters toeach other. The goal is simply to come up with a partitioning such that all posts inone cluster are most similar to each other while being dissimilar from the posts in allother clusters. Many flat clustering algorithms require the number of clusters to bespecified up front.In hierarchical clustering, the number of clusters does not have to be specified.Instead, hierarchical clustering creates a hierarchy of clusters. While similar postsare grouped into one cluster, similar clusters are again grouped into one uber-cluster.This is done recursively, until only one cluster is left that contains everything.
Inthis hierarchy, one can then choose the desired number of clusters after the fact.However, this comes at the cost of lower efficiency.SciKit provides a wide range of clustering approaches in the sklearn.clusterpackage. You can get a quick overview of advantages and drawbacks of each ofthem at http://scikit-learn.org/dev/modules/clustering.html.In the following sections, we will use the flat clustering method K-means and play abit with the desired number of clusters.K-meansk-means is the most widely used flat clustering algorithm. After initializing it withthe desired number of clusters, num_clusters, it maintains that number of so-calledcluster centroids. Initially, it will pick any num_clusters posts and set the centroidsto their feature vector.
Then it will go through all other posts and assign them thenearest centroid as their current cluster. Following this, it will move each centroidinto the middle of all the vectors of that particular class. This changes, of course, thecluster assignment. Some posts are now nearer to another cluster. So it will updatethe assignments for those changed posts. This is done as long as the centroids moveconsiderably.