Building machine learning systems with Python (779436), страница 15

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 15 страницаBuilding machine learning systems with Python (779436) страница 152017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 15)

An examplewill give us a quick impression of the noise that we have to expect. For the sake ofsimplicity, we will focus on one of the shorter posts:>>> post_group = zip(train_data.data, train_data.target)>>> all = [(len(post[0]), post[0], train_data.target_names[post[1]])for post in post_group]>>> graphics = sorted([post for post in all ifpost[2]=='comp.graphics'])>>> print(graphics[5])(245, 'From: SITUNAYA@IBM3090.BHAM.AC.UK\nSubject:test....(sorry)\nOrganization: The University of Birmingham, UnitedKingdom\nLines: 1\nNNTP-Posting-Host: ibm3090.bham.ac.uk<…snip…>','comp.graphics')For this post, there is no real indication that it belongs to comp.graphics consideringonly the wording that is left after the preprocessing step:>>> noise_post = graphics[5][1]>>> analyzer = vectorizer.build_analyzer()>>> print(list(analyzer(noise_post)))['situnaya', 'ibm3090', 'bham', 'ac', 'uk', 'subject', 'test','sorri', 'organ', 'univers', 'birmingham', 'unit', 'kingdom', 'line','nntp', 'post', 'host', 'ibm3090', 'bham', 'ac', 'uk']This is only after tokenization, lowercasing, and stop word removal.

If we alsosubtract those words that will be later filtered out via min_df and max_df, whichwill be done later in fit_transform, it gets even worse:>>> useful = set(analyzer(noise_post)).intersection(vectorizer.get_feature_names())>>> print(sorted(useful))['ac', 'birmingham', 'host', 'kingdom', 'nntp', 'sorri', 'test','uk', 'unit', 'univers'][ 75 ]Clustering – Finding Related PostsEven more, most of the words occur frequently in other posts as well, as wecan check with the IDF scores.

Remember that the higher TF-IDF, the morediscriminative a term is for a given post. As IDF is a multiplicative factor here,a low value of it signals that it is not of great value in general.>>> for term in sorted(useful):...print('IDF(%s)=%.2f'%(term,vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]))IDF(ac)=3.51IDF(birmingham)=6.77IDF(host)=1.74IDF(kingdom)=6.68IDF(nntp)=1.77IDF(sorri)=4.14IDF(test)=3.83IDF(uk)=3.70IDF(unit)=4.42IDF(univers)=1.91So, the terms with the highest discriminative power, birmingham and kingdom,clearly are not that computer graphics related, the same is the case with the termswith lower IDF scores.

Understandably, posts from different newsgroups will beclustered together.For our goal, however, this is no big deal, as we are only interested in cutting downthe number of posts that we have to compare a new post to. After all, the particularnewsgroup from where our training data came from is of no special interest.Tweaking the parametersSo what about all the other parameters? Can we tweak them to get better results?Sure. We can, of course, tweak the number of clusters, or play with the vectorizer'smax_features parameter (you should try that!).

Also, we can play with differentcluster center initializations. Then there are more exciting alternatives to K-meansitself. There are, for example, clustering approaches that let you even use differentsimilarity measurements, such as Cosine similarity, Pearson, or Jaccard. An excitingfield for you to play.[ 76 ]But before you go there, you will have to define what you actually mean by "better".SciKit has a complete package dedicated only to this definition. The package is calledsklearn.metrics and also contains a full range of different metrics to measureclustering quality.

Maybe that should be the first place to go now. Right into thesources of the metrics package.SummaryThat was a tough ride from pre-processing over clustering to a solution that canconvert noisy text into a meaningful concise vector representation, which we cancluster. If we look at the efforts we had to do to finally being able to cluster, it wasmore than half of the overall task.

But on the way, we learned quite a bit on textprocessing and how simple counting can get you very far in the noisy real-world data.The ride has been made much smoother, though, because of SciKit and its powerfulpackages. And there is more to explore. In this chapter, we were scratching thesurface of its capabilities. In the next chapters, we will see more of its power.[ 77 ]Topic ModelingIn the previous chapter, we grouped text documents using clustering. This is a veryuseful tool, but it is not always the best.

Clustering results in each text belongingto exactly one cluster. This book is about machine learning and Python. Should itbe grouped with other Python-related works or with machine-related works? In aphysical bookstore, we will need a single place to stock the book. In an Internet store,however, the answer is this book is about both machine learning and Python and the bookshould be listed in both the sections in an online bookstore. This does not mean thatthe book will be listed in all the sections, of course. We will not list this book withother baking books.In this chapter, we will learn methods that do not cluster documents into completelyseparate groups but allow each document to refer to several topics.

These topics willbe identified automatically from a collection of text documents. These documentsmay be whole books or shorter pieces of text such as a blogpost, a news story, oran e-mail.We would also like to be able to infer the fact that these documents may have topicsthat are central to them, while referring to other topics only in passing.

This bookmentions plotting every so often, but it is not a central topic as machine learning is.This means that documents have topics that are central to them and others that aremore peripheral. The subfield of machine learning that deals with these problems iscalled topic modeling and is the subject of this chapter.[ 79 ]Topic ModelingLatent Dirichlet allocationLDA and LDA—unfortunately, there are two methods in machine learning withthe initials LDA: latent Dirichlet allocation, which is a topic modeling method andlinear discriminant analysis, which is a classification method. They are completelyunrelated, except for the fact that the initials LDA can refer to either. In certainsituations, this can be confusing.

The scikit-learn tool has a submodule, sklearn.lda, which implements linear discriminant analysis. At the moment, scikit-learndoes not implement latent Dirichlet allocation.The topic model we will look at is latent Dirichlet allocation (LDA). The mathematicalideas behind LDA are fairly complex, and we will not go into the details here.For those who are interested, and adventurous enough, Wikipedia will provide allthe equations behind these algorithms: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation.However, we can understand the ideas behind LDA intuitively at a high-level.LDA belongs to a class of models that are called generative models as they have asort of fable, which explains how the data was generated.

This generative story isa simplification of reality, of course, to make machine learning easier. In the LDAfable, we first create topics by assigning probability weights to words. Each topicwill assign different weights to different words. For example, a Python topic willassign high probability to the word "variable" and a low probability to the word"inebriated". When we wish to generate a new document, we first choose the topics itwill use and then mix words from these topics.For example, let's say we have only three topics that books discuss:• Machine learning• Python• BakingFor each topic, we have a list of words associated with it. This book will be amixture of the first two topics, perhaps 50 percent each.

The mixture does not needto be equal, it can also be a 70/30 split. When we are generating the actual text, wegenerate word by word; first we decide which topic this word will come from. This isa random decision based on the topic weights. Once a topic is chosen, we generate aword from that topic's list of words. To be precise, we choose a word in English withthe probability given by the topic.In this model, the order of words does not matter. This is a bag of words model as wehave already seen in the previous chapter. It is a crude simplification of language,but it often works well enough, because just knowing which words were used in adocument and their frequencies are enough to make machine learning decisions.[ 80 ]In the real world, we do not know what the topics are.

Our task is to take a collectionof text and to reverse engineer this fable in order to discover what topics are outthere and simultaneously figure out which topics each document uses.Building a topic modelUnfortunately, scikit-learn does not support latent Dirichlet allocation. Therefore,we are going to use the gensim package in Python. Gensim is developed by RadimŘehůřek who is a machine learning researcher and consultant in the United Kingdom.We must start by installing it. We can achieve this by running the following command:pip install gensimAs input data, we are going to use a collection of news reports from the AssociatedPress (AP).

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.