Building machine learning systems with Python (779436), страница 35

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 35 страницаBuilding machine learning systems with Python (779436) страница 352017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 35)

In addition, we should also look into which genreswe actually confuse with each other. This can be done with the so-called confusionmatrix, as shown in the following:>>> from sklearn.metrics import confusion_matrix>>> cm = confusion_matrix(y_test, y_pred)>>> print(cm)[[2612002][ 475053][ 12 14283][ 54375][ 00 102 10 12][ 100 13 12]]74[ 207 ]Classification – Music Genre ClassificationThis prints the distribution of labels that the classifier predicted for the test setfor every genre.

The diagonal represents the correct classifications. Since we havesix genres, we have a six-by-six matrix. The first row in the matrix says that for31 Classical songs (sum of first row), it predicted 26 to belong to the genre Classical,1 to be a Jazz song, 2 to belong to the Country genre, and 2 to be Metal songs.The diagonal shows the correct classifications.

In the first row, we see that outof (26+1+2+2)=31 songs, 26 have been correctly classified as classical and 5 weremisclassifications. This is actually not that bad. The second row is more sobering:only 7 out of 24 Jazz songs have been correctly classified—that is, only 29 percent.Of course, we follow the train/test split setup from the previous chapters, so that weactually have to record the confusion matrices per cross-validation fold. We have toaverage and normalize later on, so that we have a range between 0 (total failure) and1 (everything classified correctly).A graphical visualization is often much easier to read than NumPy arrays. Thematshow() function of matplotlib is our friend:from matplotlib import pylabdef plot_confusion_matrix(cm, genre_list, name, title):pylab.clf()pylab.matshow(cm, fignum=False, cmap='Blues',vmin=0, vmax=1.0)ax = pylab.axes()ax.set_xticks(range(len(genre_list)))ax.set_xticklabels(genre_list)ax.xaxis.set_ticks_position("bottom")ax.set_yticks(range(len(genre_list)))ax.set_yticklabels(genre_list)pylab.title(title)pylab.colorbar()pylab.grid(False)pylab.xlabel('Predicted class')pylab.ylabel('True class')pylab.grid(False)pylab.show()[ 208 ]Chapter 9When you create a confusion matrix, be sure to choose a color map (thecmap parameter of matshow()) with an appropriate color orderingso that it is immediately visible what a lighter or darker color means.Especially discouraged for these kinds of graphs are rainbow colormaps, such as matplotlib's default jet or even the Paired color map.The final graph looks like the following:For a perfect classifier, we would have expected a diagonal of dark squares from theleft-upper corner to the right lower one, and light colors for the remaining area.

Inthe preceding graph, we immediately see that our FFT-based classifier is far awayfrom being perfect. It only predicts Classical songs correctly (dark square). For Rock,for instance, it preferred the label Metal most of the time.Obviously, using FFT points in the right direction (the Classical genre was notthat bad), but is not enough to get a decent classifier. Surely, we can play with thenumber of FFT components (fixed to 1,000). But before we dive into parametertuning, we should do our research. There we find that FFT is indeed not a badfeature for genre classification—it is just not refined enough. Shortly, we will seehow we can boost our classification performance by using a processed version of it.Before we do that, however, we will learn another method of measuringclassification performance.[ 209 ]Classification – Music Genre ClassificationAn alternative way to measure classifierperformance using receiver-operatorcharacteristicsWe already learned that measuring accuracy is not enough to truly evaluatea classifier.

Instead, we relied on precision-recall (P/R) curves to get a deeperunderstanding of how our classifiers perform.There is a sister of P/R curves, called receiver-operator-characteristics (ROC), whichmeasures similar aspects of the classifier's performance, but provides another viewof the classification performance. The key difference is that P/R curves are moresuitable for tasks where the positive class is much more interesting than the negativeone, or where the number of positive examples is much less than the number ofnegative ones. Information retrieval and fraud detection are typical application areas.On the other hand, ROC curves provide a better picture on how well the classifierbehaves in general.To better understand the differences, let us consider the performance of thepreviously trained classifier in classifying country songs correctly, as shownin the following graph:[ 210 ]Chapter 9On the left, we see the P/R curve.

For an ideal classifier, we would have the curvegoing from the top left directly to the top right and then to the bottom right, resultingin an area under curve (AUC) of 1.0.The right graph depicts the corresponding ROC curve. It plots the True Positive Rateover the False Positive Rate. There, an ideal classifier would have a curve going fromthe lower left to the top left, and then to the top right. A random classifier would bea straight line from the lower left to the upper right, as shown by the dashed line,having an AUC of 0.5. Therefore, we cannot compare an AUC of a P/R curve withthat of an ROC curve.Independent of the curve, when comparing two different classifiers on the samedataset, we are always safe to assume that a higher AUC of a P/R curve for oneclassifier also means a higher AUC of the corresponding ROC curve and vice versa.Thus, we never bother to generate both.

More on this can be found in the veryinsightful paper The Relationship Between Precision-Recall and ROC Curves by Davisand Goadrich (ICML, 2006).The following table summarizes the differences between P/R and ROC curves:x axisy axisP/RROCLooking at the definitions of both curves' x and y axis, we see that the True PositiveRate in the ROC curve's y axis is the same as Recall of the P/R graph's x axis.The False Positive Rate measures the fraction of true negative examples that werefalsely identified as positive ones, giving a 0 in a perfect case (no false positives)and 1 otherwise. Contrast this to the precision, where we track exactly the opposite,namely the fraction of true positive examples that we correctly classified as such.[ 211 ]Classification – Music Genre ClassificationGoing forward, let us use ROC curves to measure our classifiers' performance toget a better feeling for it.

The only challenge for our multiclass problem is that bothROC and P/R curves assume a binary classification problem. For our purpose, let us,therefore, create one chart per genre that shows how the classifier performed a oneversus rest classification:from sklearn.metrics import roc_curvey_pred = clf.predict(X_test)for label in labels:y_label_test = scipy.asarray(y_test==label, dtype=int)proba = clf.predict_proba(X_test)proba_label = proba[:,label]# calculate false and true positive rates as well as the# ROC thresholdsfpr, tpr, roc_thres = roc_curve(y_label_test, proba_label)# plot tpr over fpr ...The outcomes are the following six ROC plots. As we have already found out, ourfirst version of a classifier only performs well on Classical songs.

Looking at theindividual ROC curves, however, tells us that we are really underperforming formost of the other genres. Only Jazz and Country provide some hope. The remaininggenres are clearly not usable.[ 212 ]Chapter 9[ 213 ]Classification – Music Genre ClassificationImproving classification performancewith Mel Frequency Cepstral CoefficientsWe already learned that FFT is pointing in the right direction, but in itself it willnot be enough to finally arrive at a classifier that successfully manages to organizeour scrambled directory of songs of diverse music genres into individual genredirectories. We need a somewhat more advanced version of it.At this point, it is always wise to acknowledge that we have to do more research.Other people might have had similar challenges in the past and already have foundout new ways that might also help us.

And, indeed, there is even a yearly conferencededicated to only music genre classification, organized by the International Societyfor Music Information Retrieval (ISMIR). Apparently, Automatic Music GenreClassification (AMGC) is an established subfield of Music Information Retrieval.Glancing over some of the AMGC papers, we see that there is a bunch of worktargeting automatic genre classification that might help us.One technique that seems to be successfully applied in many of those works iscalled Mel Frequency Cepstral Coefficients. The Mel Frequency Cepstrum (MFC)encodes the power spectrum of a sound, which is the power of each frequency thesound contains. It is calculated as the Fourier transform of the logarithm of thesignal's spectrum.

If that sounds too complicated, simply remember that the name"cepstrum" originates from "spectrum" having the first four characters reversed. MFChas been successfully used in speech and speaker recognition. Let's see whether italso works in our case.We are in a lucky situation in that someone else already needed exactly this andpublished an implementation of it as the Talkbox SciKit. We can install it fromhttps://pypi.python.org/pypi/scikits.talkbox. Afterward, we can callthe mfcc() function, which calculates the MFC coefficients, as follows:>>> from scikits.talkbox.features import mfcc>>> sample_rate, X = scipy.io.wavfile.read(fn)>>> ceps, mspec, spec = mfcc(X)>>> print(ceps.shape)(4135, 13)[ 214 ]Chapter 9The data we would want to feed into our classifier is stored in ceps, which contains13 coefficients (default value for the nceps parameter of mfcc()) for each of the 4,135frames for the song with the filename fn.

Taking all of the data would overwhelmour classifier. What we could do, instead, is to do an averaging per coefficient overall the frames. Assuming that the start and end of each song are possibly less genrespecific than the middle part of it, we also ignore the first and last 10 percent:x = np.mean(ceps[int(num_ceps*0.1):int(num_ceps*0.9)], axis=0)Sure enough, the benchmark dataset we will be using contains only the first30 seconds of each song, so that we would not need to cut off the last 10 percent.We do it, nevertheless, so that our code works on other datasets as well, whichare most likely not truncated.Similar to our work with FFT, we certainly would also want to cache the oncegenerated MFCC features and read them instead of recreating them each timewe train our classifier.This leads to the following code:def write_ceps(ceps, fn):base_fn, ext = os.path.splitext(fn)data_fn = base_fn + ".ceps"np.save(data_fn, ceps)print("Written to %s" % data_fn)def create_ceps(fn):sample_rate, X = scipy.io.wavfile.read(fn)ceps, mspec, spec = mfcc(X)write_ceps(ceps, fn)def read_ceps(genre_list, base_dir=GENRE_DIR):X, y = [], []for label, genre in enumerate(genre_list):for fn in glob.glob(os.path.join(base_dir, genre, "*.ceps.npy")):ceps = np.load(fn)num_ceps = len(ceps)X.append(np.mean(ceps[int(num_ceps*0.1):int(num_ceps*0.9)], axis=0))y.append(label)return np.array(X), np.array(y)[ 215 ]Classification – Music Genre ClassificationWe get the following promising results with a classifier that uses only 13 featuresper song:[ 216 ]Chapter 9The classification performances for all genres have improved.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.