Building machine learning systems with Python (779436), страница 12

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 12 страницаBuilding machine learning systems with Python (779436) страница 122017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 12)

The followingwords that have been tokenized will be counted:>>> print(vectorizer.get_feature_names())[u'about', u'actually', u'capabilities', u'contains', u'data',u'databases', u'images', u'imaging', u'interesting', u'is', u'it',u'learning', u'machine', u'most', u'much', u'not', u'permanently',u'post', u'provide', u'save', u'storage', u'store', u'stuff',u'this', u'toy']Now we can vectorize our new post.>>> new_post = "imaging databases">>> new_post_vec = vectorizer.transform([new_post])Note that the count vectors returned by the transform method are sparse.

That is,each vector does not store one count value for each word, as most of those countswill be zero (the post does not contain the word). Instead, it uses the more memoryefficient implementation coo_matrix (for "COOrdinate"). Our new post, for instance,actually contains only two elements:>>> print(new_post_vec)(0, 7)1(0, 5)1[ 56 ]Via its toarray() member, we can once again access the full ndarray:>>> print(new_post_vec.toarray())[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]We need to use the full array, if we want to use it as a vector for similaritycalculations. For the similarity measurement (the naïve one), we calculate theEuclidean distance between the count vectors of the new post and all the old posts:>>> import scipy as sp>>> def dist_raw(v1, v2):...delta = v1-v2...return sp.linalg.norm(delta.toarray())The norm() function calculates the Euclidean norm (shortest distance).

This is just oneobvious first pick and there are many more interesting ways to calculate the distance.Just take a look at the paper Distance Coefficients between Two Lists or Sets in The PythonPapers Source Codes, in which Maurice Ling nicely presents 35 different ones.With dist_raw, we just need to iterate over all the posts and remember thenearest one:>>> import sys>>> best_doc = None>>> best_dist = sys.maxint>>> best_i = None>>> for i, post in enumerate(num_samples):......if post == new_post:continue...post_vec = X_train.getrow(i)...d = dist_raw(post_vec, new_post_vec)...print("=== Post %i with dist=%.2f: %s"%(i, d, post))...if d<best_dist:...best_dist = d...best_i = i>>> print("Best post is %i with dist=%.2f"%(best_i, best_dist))=== Post 0 with dist=4.00: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=1.73: Imaging databases provide storagecapabilities.[ 57 ]Clustering – Finding Related Posts=== Post 2 with dist=2.00: Most imaging databases save imagespermanently.=== Post 3 with dist=1.41: Imaging databases store data.=== Post 4 with dist=5.10: Imaging databases store data.

Imagingdatabases store data. Imaging databases store data.Best post is 3 with dist=1.41Congratulations, we have our first similarity measurement. Post 0 is most dissimilarfrom our new post. Quite understandably, it does not have a single word in commonwith the new post. We can also understand that Post 1 is very similar to the newpost, but not the winner, as it contains one word more than Post 3, which is notcontained in the new post.Looking at Post 3 and Post 4, however, the picture is not so clear any more.

Post 4 isthe same as Post 3 duplicated three times. So, it should also be of the same similarityto the new post as Post 3.Printing the corresponding feature vectors explains why:>>> print(X_train.getrow(3).toarray())[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]>>> print(X_train.getrow(4).toarray())[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]Obviously, using only the counts of the raw words is too simple. We will have tonormalize them to get vectors of unit length.Normalizing word count vectorsWe will have to extend dist_raw to calculate the vector distance not on the rawvectors but on the normalized instead:>>> def dist_norm(v1, v2):...v1_normalized = v1/sp.linalg.norm(v1.toarray())...v2_normalized = v2/sp.linalg.norm(v2.toarray())...delta = v1_normalized - v2_normalized...return sp.linalg.norm(delta.toarray())This leads to the following similarity measurement:=== Post 0 with dist=1.41: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=0.86: Imaging databases provide storagecapabilities.[ 58 ]=== Post 2 with dist=0.92: Most imaging databases save imagespermanently.=== Post 3 with dist=0.77: Imaging databases store data.=== Post 4 with dist=0.77: Imaging databases store data.

Imagingdatabases store data. Imaging databases store data.Best post is 3 with dist=0.77This looks a bit better now. Post 3 and Post 4 are calculated as being equally similar.One could argue whether that much repetition would be a delight to the reader, butfrom the point of counting the words in the posts this seems to be right.Removing less important wordsLet's have another look at Post 2. Of its words that are not in the new post, we have"most", "save", "images", and "permanently".

They are actually quite different in theoverall importance to the post. Words such as "most" appear very often in all sorts ofdifferent contexts and are called stop words. They do not carry as much informationand thus should not be weighed as much as words such as "images", which doesn'toccur often in different contexts. The best option would be to remove all the wordsthat are so frequent that they do not help to distinguish between different texts.These words are called stop words.As this is such a common step in text processing, there is a simple parameter inCountVectorizer to achieve that:>>> vectorizer = CountVectorizer(min_df=1, stop_words='english')If you have a clear picture of what kind of stop words you would want to remove,you can also pass a list of them.

Setting stop_words to english will use a set of318 English stop words. To find out which ones, you can use get_stop_words():>>> sorted(vectorizer.get_stop_words())[0:20]['a', 'about', 'above', 'across', 'after', 'afterwards', 'again','against', 'all', 'almost', 'alone', 'along', 'already', 'also','although', 'always', 'am', 'among', 'amongst', 'amoungst']The new word list is seven words lighter:[u'actually', u'capabilities', u'contains', u'data', u'databases',u'images', u'imaging', u'interesting', u'learning', u'machine',u'permanently', u'post', u'provide', u'save', u'storage', u'store',u'stuff', u'toy'][ 59 ]Clustering – Finding Related PostsWithout stop words, we arrive at the following similarity measurement:=== Post 0 with dist=1.41: This is a toy post about machine learning.Actually, it contains not much interesting stuff.=== Post 1 with dist=0.86: Imaging databases provide storagecapabilities.=== Post 2 with dist=0.86: Most imaging databases save imagespermanently.=== Post 3 with dist=0.77: Imaging databases store data.=== Post 4 with dist=0.77: Imaging databases store data.

Imagingdatabases store data. Imaging databases store data.Best post is 3 with dist=0.77Post 2 is now on par with Post 1. It has, however, changed not much overall sinceour posts are kept short for demonstration purposes. It will become vital when welook at real-world data.StemmingOne thing is still missing. We count similar words in different variants as differentwords. Post 2, for instance, contains "imaging" and "images".

It will make sense tocount them together. After all, it is the same concept they are referring to.We need a function that reduces words to their specific word stem. SciKit does notcontain a stemmer by default. With the Natural Language Toolkit (NLTK), we candownload a free software toolkit, which provides a stemmer that we can easily pluginto CountVectorizer.Installing and using NLTKHow to install NLTK on your operating system is described in detail at http://nltk.org/install.html. Unfortunately, it is not yet officially supported for Python3, which means that also pip install will not work.

We can, however, download thepackage from http://www.nltk.org/nltk3-alpha/ and install it manually afteruncompressing using Python's setup.py install.To check whether your installation was successful, open a Python interpreterand type:>>> import nltk[ 60 ]You will find a very nice tutorial to NLTK in the book Python 3 TextProcessing with NLTK 3 Cookbook, Jacob Perkins, Packt Publishing.

Toplay a little bit with a stemmer, you can visit the web page http://text-processing.com/demo/stem/.NLTK comes with different stemmers. This is necessary, because every language hasa different set of rules for stemming. For English, we can take SnowballStemmer.>>> import nltk.stem>>> s = nltk.stem.SnowballStemmer('english')>>> s.stem("graphics")u'graphic'>>> s.stem("imaging")u'imag'>>> s.stem("image")u'imag'>>> s.stem("imagination")u'imagin'>>> s.stem("imagine")u'imagin'Note that stemming does not necessarily have to result invalid English words.It also works with verbs:>>> s.stem("buys")u'buy'>>> s.stem("buying")u'buy'This means, it works most of the time:>>> s.stem("bought")u'bought'[ 61 ]Clustering – Finding Related PostsExtending the vectorizer with NLTK's stemmerWe need to stem the posts before we feed them into CountVectorizer.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.