Building machine learning systems with Python (779436), страница 14

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 14 страницаBuilding machine learning systems with Python (779436) страница 142017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 14)

After some iterations, the movements will fall below a threshold andwe consider clustering to be converged.[ 66 ]Let's play this through with a toy example of posts containing only two words. Eachpoint in the following chart represents one document:[ 67 ]Clustering – Finding Related PostsAfter running one iteration of K-means, that is, taking any two vectors as startingpoints, assigning the labels to the rest and updating the cluster centers to now be thecenter point of all points in that cluster, we get the following clustering:[ 68 ]Because the cluster centers moved, we have to reassign the cluster labels andrecalculate the cluster centers.

After iteration 2, we get the following clustering:The arrows show the movements of the cluster centers. After five iterations in thisexample, the cluster centers don't move noticeably any more (SciKit's tolerancethreshold is 0.0001 by default).After the clustering has settled, we just need to note down the cluster centers andtheir identity. Each new document that comes in, we then have to vectorize andcompare against all cluster centers. The cluster center with the smallest distance toour new post vector belongs to the cluster we will assign to the new post.[ 69 ]Clustering – Finding Related PostsGetting test data to evaluate our ideas onIn order to test clustering, let's move away from the toy text examples and find adataset that resembles the data we are expecting in the future so that we can testour approach.

For our purpose, we need documents about technical topics that arealready grouped together so that we can check whether our algorithm works asexpected when we apply it later to the posts we hope to receive.One standard dataset in machine learning is the 20newsgroup dataset, whichcontains 18,826 posts from 20 different newsgroups. Among the groups' topics aretechnical ones such as comp.sys.mac.hardware or sci.crypt, as well as morepolitics- and religion-related ones such as talk.politics.guns or soc.religion.christian.

We will restrict ourselves to the technical groups. If we assume eachnewsgroup as one cluster, we can nicely test whether our approach of finding relatedposts works.The dataset can be downloaded from http://people.csail.mit.edu/jrennie/20Newsgroups.

Much more comfortable, however, is to download itfrom MLComp at http://mlcomp.org/datasets/379 (free registration required).SciKit already contains custom loaders for that dataset and rewards you with veryconvenient data loading options.The dataset comes in the form of a ZIP file dataset-379-20news-18828_WJQIG.zip,which we have to unzip to get the directory 379, which contains the datasets.

Wealso have to notify SciKit about the path containing that data directory. It containsa metadata file and three directories test, train, and raw. The test and traindirectories split the whole dataset into 60 percent of training and 40 percent of testingposts. If you go this route, then you either need to set the environment variableMLCOMP_DATASETS_HOME or you specify the path directly with the mlcomp_rootparameter when loading the dataset.http://mlcomp.org is a website for comparing machine learningprograms on diverse datasets. It serves two purposes: finding theright dataset to tune your machine learning program, and exploringhow other people use a particular dataset. For instance, you can seehow well other people's algorithms performed on particular datasetsand compare against them.[ 70 ]For convenience, the sklearn.datasets module also contains thefetch_20newsgroups function, which automatically downloads thedata behind the scenes:>>> import sklearn.datasets>>> all_data = sklearn.datasets.fetch_20newsgroups(subset='all')>>> print(len(all_data.filenames))18846>>> print(all_data.target_names)['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware','comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles','rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt','sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian','talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc','talk.religion.misc']We can choose between training and test sets:>>> train_data = sklearn.datasets.fetch_20newsgroups(subset='train',categories=groups)>>> print(len(train_data.filenames))11314>>> test_data = sklearn.datasets.fetch_20newsgroups(subset='test')>>> print(len(test_data.filenames))7532For simplicity's sake, we will restrict ourselves to only some newsgroups sothat the overall experimentation cycle is shorter.

We can achieve this with thecategories parameter:>>> groups = ['comp.graphics', 'comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware','comp.windows.x', 'sci.space']>>> train_data = sklearn.datasets.fetch_20newsgroups(subset='train',categories=groups)>>> print(len(train_data.filenames))3529>>> test_data = sklearn.datasets.fetch_20newsgroups(subset='test',categories=groups)>>> print(len(test_data.filenames))2349[ 71 ]Clustering – Finding Related PostsClustering postsYou would have already noticed one thing—real data is noisy. The newsgroupdataset is no exception. It even contains invalid characters that will result inUnicodeDecodeError.We have to tell the vectorizer to ignore them:>>> vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5,...stop_words='english', decode_error='ignore')>>> vectorized = vectorizer.fit_transform(train_data.data)>>> num_samples, num_features = vectorized.shape>>> print("#samples: %d, #features: %d" % (num_samples,num_features))#samples: 3529, #features: 4712We now have a pool of 3,529 posts and extracted for each of them a feature vectorof 4,712 dimensions.

That is what K-means takes as input. We will fix the cluster sizeto 50 for this chapter and hope you are curious enough to try out different values asan exercise.>>> num_clusters = 50>>> from sklearn.cluster import KMeans>>> km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1, random_state=3)>>> km.fit(vectorized)That's it. We provided a random state just so that you can get the same results. Inreal-world applications, you will not do this. After fitting, we can get the clusteringinformation out of members of km. For every vectorized post that has been fit, there isa corresponding integer label in km.labels_:>>> print(km.labels_)[48 23 31 ...,62 22]>>> print(km.labels_.shape)3529The cluster centers can be accessed via km.cluster_centers_.In the next section, we will see how we can assign a cluster to a newly arriving postusing km.predict.[ 72 ]Solving our initial challengeWe will now put everything together and demonstrate our system for the followingnew post that we assign to the new_post variable:"Disk drive problems.

Hi, I have a problem with my hard disk.After 1 year it is working only sporadically now.I tried to format it, but now it doesn't boot any more.Any ideas? Thanks."As you learned earlier, you will first have to vectorize this post before you predictits label:>>> new_post_vec = vectorizer.transform([new_post])>>> new_post_label = km.predict(new_post_vec)[0]Now that we have the clustering, we do not need to compare new_post_vec to allpost vectors.

Instead, we can focus only on the posts of the same cluster. Let's fetchtheir indices in the original data set:>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]The comparison in the bracket results in a Boolean array, and nonzero converts thatarray into a smaller array containing the indices of the True elements.Using similar_indices, we then simply have to build a list of posts together withtheir similarity scores:>>> similar = []>>> for i in similar_indices:...dist = sp.linalg.norm((new_post_vec vectorized[i]).toarray())...similar.append((dist, dataset.data[i]))>>> similar = sorted(similar)>>> print(len(similar))131[ 73 ]Clustering – Finding Related PostsWe found 131 posts in the cluster of our post. To give the user a quick idea of whatkind of similar posts are available, we can now present the most similar post (show_at_1), and two less similar but still related ones – all from the same cluster.>>> show_at_1 = similar[0]>>> show_at_2 = similar[int(len(similar)/10)]>>> show_at_3 = similar[int(len(similar)/2)]The following table shows the posts together with their similarity values:PositionSimilarityExcerpt from post11.038BOOT PROBLEM with IDE controllerHi,I've got a Multi I/O card (IDE controller + serial/parallel interface)and two floppy drives (5 1/4, 3 1/2) and a Quantum ProDrive80AT connected to it.

I was able to format the hard disk, but Icould not boot from it. I can boot from drive A: (which disk drivedoes not matter) but if I remove the disk from drive A and pressthe reset switch, the LED of drive A: continues to glow, and thehard disk is not accessed at all. I guess this must be a problem ofeither the Multi I/o card or floppy disk drive settings (jumperconfiguration?) Does someone have any hint what could be thereason for it.

[…]21.150Booting from B driveI have a 5 1/4" drive as drive A. How can I make the system bootfrom my 3 1/2" B drive? (Optimally, the computer would be ableto boot: from either A or B, checking them in order for a bootabledisk. But: if I have to switch cables around and simply switch thedrives so that: it can't boot 5 1/4" disks, that's OK. Also, boot_bwon't do the trick for me. […][…]31.280IBM PS/1 vs TEAC FDHello, I already tried our national news group without success.

Itried to replace a friend s original IBM floppy disk in his PS/1-PCwith a normal TEAC drive. I already identified the power supplyon pins 3 (5V) and 6 (12V), shorted pin 6 (5.25"/3.5" switch) andinserted pullup resistors (2K2) on pins 8, 26, 28, 30, and 34. Thecomputer doesn't complain about a missing FD, but the FD s lightstays on all the time. The drive spins up o.k. when I insert a disk,but I can't access it.

The TEAC works fine in a normal PC. Arethere any points I missed? […][…][ 74 ]It is interesting how the posts reflect the similarity measurement score. The first postcontains all the salient words from our new post. The second also revolves aroundbooting problems, but is about floppy disks and not hard disks. Finally, the thirdis neither about hard disks, nor about booting problems. Still, of all the posts, wewould say that they belong to the same domain as the new post.Another look at noiseWe should not expect a perfect clustering in the sense that posts from the samenewsgroup (for example, comp.graphics) are also clustered together.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.