Building machine learning systems with Python (779436), страница 12
Текст из файла (страница 12)
The followingwords that have been tokenized will be counted:>>> print(vectorizer.get_feature_names())[u'about', u'actually', u'capabilities', u'contains', u'data',u'databases', u'images', u'imaging', u'interesting', u'is', u'it',u'learning', u'machine', u'most', u'much', u'not', u'permanently',u'post', u'provide', u'save', u'storage', u'store', u'stuff',u'this', u'toy']Now we can vectorize our new post.>>> new_post = "imaging databases">>> new_post_vec = vectorizer.transform([new_post])Note that the count vectors returned by the transform method are sparse.
That is,each vector does not store one count value for each word, as most of those countswill be zero (the post does not contain the word). Instead, it uses the more memoryefficient implementation coo_matrix (for "COOrdinate"). Our new post, for instance,actually contains only two elements:>>> print(new_post_vec)(0, 7)1(0, 5)1[ 56 ]Via its toarray() member, we can once again access the full ndarray:>>> print(new_post_vec.toarray())[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]We need to use the full array, if we want to use it as a vector for similaritycalculations. For the similarity measurement (the naïve one), we calculate theEuclidean distance between the count vectors of the new post and all the old posts:>>> import scipy as sp>>> def dist_raw(v1, v2):...delta = v1-v2...return sp.linalg.norm(delta.toarray())The norm() function calculates the Euclidean norm (shortest distance).
This is just oneobvious first pick and there are many more interesting ways to calculate the distance.Just take a look at the paper Distance Coefficients between Two Lists or Sets in The PythonPapers Source Codes, in which Maurice Ling nicely presents 35 different ones.With dist_raw, we just need to iterate over all the posts and remember thenearest one:>>> import sys>>> best_doc = None>>> best_dist = sys.maxint>>> best_i = None>>> for i, post in enumerate(num_samples):......if post == new_post:continue...post_vec = X_train.getrow(i)...d = dist_raw(post_vec, new_post_vec)...print("=== Post %i with dist=%.2f: %s"%(i, d, post))...if d<best_dist:...best_dist = d...best_i = i>>> print("Best post is %i with dist=%.2f"%(best_i, best_dist))=== Post 0 with dist=4.00: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=1.73: Imaging databases provide storagecapabilities.[ 57 ]Clustering – Finding Related Posts=== Post 2 with dist=2.00: Most imaging databases save imagespermanently.=== Post 3 with dist=1.41: Imaging databases store data.=== Post 4 with dist=5.10: Imaging databases store data.
Imagingdatabases store data. Imaging databases store data.Best post is 3 with dist=1.41Congratulations, we have our first similarity measurement. Post 0 is most dissimilarfrom our new post. Quite understandably, it does not have a single word in commonwith the new post. We can also understand that Post 1 is very similar to the newpost, but not the winner, as it contains one word more than Post 3, which is notcontained in the new post.Looking at Post 3 and Post 4, however, the picture is not so clear any more.
Post 4 isthe same as Post 3 duplicated three times. So, it should also be of the same similarityto the new post as Post 3.Printing the corresponding feature vectors explains why:>>> print(X_train.getrow(3).toarray())[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]>>> print(X_train.getrow(4).toarray())[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]Obviously, using only the counts of the raw words is too simple. We will have tonormalize them to get vectors of unit length.Normalizing word count vectorsWe will have to extend dist_raw to calculate the vector distance not on the rawvectors but on the normalized instead:>>> def dist_norm(v1, v2):...v1_normalized = v1/sp.linalg.norm(v1.toarray())...v2_normalized = v2/sp.linalg.norm(v2.toarray())...delta = v1_normalized - v2_normalized...return sp.linalg.norm(delta.toarray())This leads to the following similarity measurement:=== Post 0 with dist=1.41: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=0.86: Imaging databases provide storagecapabilities.[ 58 ]=== Post 2 with dist=0.92: Most imaging databases save imagespermanently.=== Post 3 with dist=0.77: Imaging databases store data.=== Post 4 with dist=0.77: Imaging databases store data.
Imagingdatabases store data. Imaging databases store data.Best post is 3 with dist=0.77This looks a bit better now. Post 3 and Post 4 are calculated as being equally similar.One could argue whether that much repetition would be a delight to the reader, butfrom the point of counting the words in the posts this seems to be right.Removing less important wordsLet's have another look at Post 2. Of its words that are not in the new post, we have"most", "save", "images", and "permanently".
They are actually quite different in theoverall importance to the post. Words such as "most" appear very often in all sorts ofdifferent contexts and are called stop words. They do not carry as much informationand thus should not be weighed as much as words such as "images", which doesn'toccur often in different contexts. The best option would be to remove all the wordsthat are so frequent that they do not help to distinguish between different texts.These words are called stop words.As this is such a common step in text processing, there is a simple parameter inCountVectorizer to achieve that:>>> vectorizer = CountVectorizer(min_df=1, stop_words='english')If you have a clear picture of what kind of stop words you would want to remove,you can also pass a list of them.
Setting stop_words to english will use a set of318 English stop words. To find out which ones, you can use get_stop_words():>>> sorted(vectorizer.get_stop_words())[0:20]['a', 'about', 'above', 'across', 'after', 'afterwards', 'again','against', 'all', 'almost', 'alone', 'along', 'already', 'also','although', 'always', 'am', 'among', 'amongst', 'amoungst']The new word list is seven words lighter:[u'actually', u'capabilities', u'contains', u'data', u'databases',u'images', u'imaging', u'interesting', u'learning', u'machine',u'permanently', u'post', u'provide', u'save', u'storage', u'store',u'stuff', u'toy'][ 59 ]Clustering – Finding Related PostsWithout stop words, we arrive at the following similarity measurement:=== Post 0 with dist=1.41: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=0.86: Imaging databases provide storagecapabilities.=== Post 2 with dist=0.86: Most imaging databases save imagespermanently.=== Post 3 with dist=0.77: Imaging databases store data.=== Post 4 with dist=0.77: Imaging databases store data.
Imagingdatabases store data. Imaging databases store data.Best post is 3 with dist=0.77Post 2 is now on par with Post 1. It has, however, changed not much overall sinceour posts are kept short for demonstration purposes. It will become vital when welook at real-world data.StemmingOne thing is still missing. We count similar words in different variants as differentwords. Post 2, for instance, contains "imaging" and "images".
It will make sense tocount them together. After all, it is the same concept they are referring to.We need a function that reduces words to their specific word stem. SciKit does notcontain a stemmer by default. With the Natural Language Toolkit (NLTK), we candownload a free software toolkit, which provides a stemmer that we can easily pluginto CountVectorizer.Installing and using NLTKHow to install NLTK on your operating system is described in detail at http://nltk.org/install.html. Unfortunately, it is not yet officially supported for Python3, which means that also pip install will not work.
We can, however, download thepackage from http://www.nltk.org/nltk3-alpha/ and install it manually afteruncompressing using Python's setup.py install.To check whether your installation was successful, open a Python interpreterand type:>>> import nltk[ 60 ]You will find a very nice tutorial to NLTK in the book Python 3 TextProcessing with NLTK 3 Cookbook, Jacob Perkins, Packt Publishing.
Toplay a little bit with a stemmer, you can visit the web page http://text-processing.com/demo/stem/.NLTK comes with different stemmers. This is necessary, because every language hasa different set of rules for stemming. For English, we can take SnowballStemmer.>>> import nltk.stem>>> s = nltk.stem.SnowballStemmer('english')>>> s.stem("graphics")u'graphic'>>> s.stem("imaging")u'imag'>>> s.stem("image")u'imag'>>> s.stem("imagination")u'imagin'>>> s.stem("imagine")u'imagin'Note that stemming does not necessarily have to result invalid English words.It also works with verbs:>>> s.stem("buys")u'buy'>>> s.stem("buying")u'buy'This means, it works most of the time:>>> s.stem("bought")u'bought'[ 61 ]Clustering – Finding Related PostsExtending the vectorizer with NLTK's stemmerWe need to stem the posts before we feed them into CountVectorizer.