Building machine learning systems with Python (779436), страница 18

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 18 страницаBuilding machine learning systems with Python (779436) страница 182017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 18)

A different number of topics or values for parameterssuch as alpha will result in systems whose end results arealmost identical in their final results.On the other hand, if you are going to explore the topics directly, or build avisualization tool that exposes them, you should probably try a few valuesand see which gives you the most useful or most appealing results.Alternatively, there are a few methods that will automatically determine thenumber of topics for you, depending on the dataset. One popular model is calledthe hierarchical Dirichlet process.

Again, the full mathematical model behind it iscomplex and beyond the scope of this book. However, the fable we can tell is thatinstead of having the topics fixed first as in the LDA generative story, the topicsthemselves were generated along with the data, one at a time. Whenever the writerstarts a new document, they have the option of using the topics that already exist orto create a completely new one. When more topics have already been created, theprobability of creating a new one, instead of reusing what exists goes down, but thepossibility always exists.This means that the more documents we have, the more topics we will end up with.

Thisis one of those statements that is unintuitive at first but makes perfect sense uponreflection. We are grouping documents and the more examples we have, the morewe can break them up. If we only have a few examples of news articles, then "Sports"will be a topic. However, as we have more, we start to break it up into the individualmodalities: "Hockey", "Soccer", and so on. As we have even more data, we can startto tell nuances apart, articles about individual teams and even individual players.The same is true for people.

In a group of many different backgrounds, with a few"computer people", you might put them together; in a slightly larger group, you willhave separate gatherings for programmers and systems administrators; and in thereal-world, we even have different gatherings for Python and Ruby programmers.The hierarchical Dirichlet process (HDP) is available in gensim. Using it is trivial.To adapt the code we wrote for LDA, we just need to replace the call to gensim.models.ldamodel.LdaModel with a call to the HdpModel constructor as follows:>>> hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)That's it (except that it takes a bit longer to compute—there are no free lunches).Now, we can use this model in much the same way as we used the LDA model,except that we did not need to specify the number of topics.[ 93 ]Topic ModelingSummaryIn this chapter, we discussed topic modeling.

Topic modeling is more flexible thanclustering as these methods allow each document to be partially present in morethan one group. To explore these methods, we used a new package, gensim.Topic modeling was first developed and is easier to understand in the case of text,but in the computer vision chapter we will see how some of these techniques maybe applied to images as well. Topic models are very important in modern computervision research. In fact, unlike the previous chapters, this chapter was very close to thecutting edge of research in machine learning algorithms. The original LDA algorithmwas published in a scientific journal in 2003, but the method that gensim uses to beable to handle Wikipedia was only developed in 2010 and the HDP algorithm is from2011.

The research continues and you can find many variations and models withwonderful names such as the Indian buffet process (not to be confused with the Chineserestaurant process, which is a different model), or Pachinko allocation (Pachinko being atype of Japanese game, a cross between a slot-machine and pinball).We have now gone through some of the major machine learning modes:classification, clustering, and topic modeling.In the next chapter, we go back to classification, but this time, we will be exploringadvanced algorithms and approaches.[ 94 ]Classification – DetectingPoor AnswersNow that we are able to extract useful features from text, we can take on thechallenge of building a classifier using real data. Let's come back to our imaginarywebsite in Chapter 3, Clustering – Finding Related Posts, where users can submitquestions and get them answered.A continuous challenge for owners of those Q&A sites is to maintain a decent level ofquality in the posted content.

Sites such as StackOverflow make considerable effortsto encourage users with diverse possibilities to score content and offer badges andbonus points in order to encourage the users to spend more energy on carving outthe question or crafting a possible answer.One particular successful incentive is the ability for the asker to flag one answerto their question as the accepted answer (again there are incentives for the askerto flag answers as such). This will result in more score points for the author ofthe flagged answer.Would it not be very useful to the user to immediately see how good his answer iswhile he is typing it in? That means, the website would continuously evaluate hiswork-in-progress answer and provide feedback as to whether the answer showssome signs of a poor one.

This will encourage the user to put more effort into writingthe answer (providing a code example? including an image?), and thus improve theoverall system.Let's build such a mechanism in this chapter.[ 95 ]Classification – Detecting Poor AnswersSketching our roadmapAs we will build a system using real data that is very noisy, this chapter is not for thefainthearted, as we will not arrive at the golden solution of a classifier that achieves100 percent accuracy; often, even humans disagree whether an answer was goodor not (just look at some of the StackOverflow comments).

Quite the contrary, wewill find out that some problems like this one are so hard that we have to adjustour initial goals on the way. But on the way, we will start with the nearest neighborapproach, find out why it is not very good for the task, switch over to logisticregression, and arrive at a solution that will achieve good enough prediction quality,but on a smaller part of the answers. Finally, we will spend some time looking athow to extract the winner to deploy it on the target system.Learning to classify classy answersIn classification, we want to find the corresponding classes, sometimes also calledlabels, for given data instances.

To be able to achieve this, we need to answertwo questions:• How should we represent the data instances?• Which model or structure should our classifier possess?Tuning the instanceIn its simplest form, in our case, the data instance is the text of the answer and thelabel would be a binary value indicating whether the asker accepted this text as ananswer or not.

Raw text, however, is a very inconvenient representation to processfor most machine learning algorithms. They want numbers. And it will be our task toextract useful features from the raw text, which the machine learning algorithm canthen use to learn the right label for it.Tuning the classifierOnce we have found or collected enough (text, label) pairs, we can train a classifier.For the underlying structure of the classifier, we have a wide range of possibilities,each of them having advantages and drawbacks.

Just to name some of the moreprominent choices, there are logistic regression, decision trees, SVMs, and NaïveBayes. In this chapter, we will contrast the instance-based method from the lastchapter, nearest neighbor, with model-based logistic regression.[ 96 ]Chapter 5Fetching the dataLuckily for us, the team behind StackOverflow provides most of the data behind theStackExchange universe to which StackOverflow belongs under a cc-wiki license.At the time of writing this book, the latest data dump can be found at https://archive.org/details/stackexchange.

It contains data dumps of all Q&A sites ofthe StackExchange family. For StackOverflow, you will find multiple files, of whichwe only need the stackoverflow.com-Posts.7z file, which is 5.2 GB.After downloading and extracting it, we have around 26 GB of data in the format ofXML, containing all questions and answers as individual row tags within the roottag posts:<?xml version="1.0" encoding="utf-8"?><posts>...<row Id="4572748" PostTypeId="2" ParentId="4568987"CreationDate="2011-01-01T00:01:03.387" Score="4" ViewCount=""Body="IANAL, but <ahref="http://support.apple.com/kb/HT2931"rel="nofollow">this</a> indicates to me that youcannot use the loops in yourapplication: <blockquote> ...however, individual audio loops may not becommercially or otherwise distributed on a standalone basis,nor may they be repackaged in whole or in part as audiosamples, sound effects or music beds." So don't worry, you can make commercial musicwith GarageBand, you just can't distribute the loops as loops. </blockquote> " OwnerUserId="203568"LastActivityDate="2011-01-01T00:01:03.387" CommentCount="1" />…</posts>NameIdTypeIntegerDescriptionPostTypeIdIntegerThis describes the category of the post.

The valuesinteresting to us are the following:This is a unique identifier.• Question• AnswerOther values will be ignored.ParentIdIntegerThis is a unique identifier of the question to whichthis answer belongs (missing for questions).[ 97 ]Classification – Detecting Poor AnswersNameCreationDateTypeDateTimeDescriptionScoreIntegerThis is the score of the post.ViewCountThis is the number of user views for this post.BodyIntegeror emptyStringOwnerUserIdIdThis is a unique identifier of the poster.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.