Building machine learning systems with Python (779436), страница 34
Текст из файла (страница 34)
Our task isto sort them according to the music genre into different folders such as jazz, classical,country, pop, rock, and metal.[ 199 ]Classification – Music Genre ClassificationFetching the music dataWe will use the GTZAN dataset, which is frequently used to benchmark music genreclassification tasks.
It is organized into 10 distinct genres, of which we will use only 6for the sake of simplicity: Classical, Jazz, Country, Pop, Rock, and Metal. The datasetcontains the first 30 seconds of 100 songs per genre. We can download the datasetfrom http://opihi.cs.uvic.ca/sound/genres.tar.gz.The tracks are recorded at 22,050 Hz (22,050 readingsper second) mono in the WAV format.Converting into a WAV formatSure enough, if we would want to test our classifier later on our private MP3collection, we would not be able to extract much meaning. This is because MP3 isa lossy music compression format that cuts out parts that the human ear cannotperceive. This is nice for storing because with MP3 you can fit 10 times as manysongs on your device.
For our endeavor, however, it is not so nice. For classification,we will have an easier game with WAV files, because they can be directly read by thescipy.io.wavfile package. We would, therefore, have to convert our MP3 files incase we want to use them with our classifier.In case you don't have a conversion tool nearby, you might wantto check out SoX at http://sox.sourceforge.net.
It claimsto be the Swiss Army Knife of sound processing, and we agreewith this bold claim.One advantage of having all our music files in the WAV format is that it is directlyreadable by the SciPy toolkit:>>> sample_rate, X = scipy.io.wavfile.read(wave_filename)[ 200 ]Chapter 9X now contains the samples and sample_rate is the rate at which they were taken.Let us use that information to peek into some music files to get a first impression ofwhat the data looks like.Looking at musicA very convenient way to get a quick impression of what the songs of the diversegenres "look" like is to draw a spectrogram for a set of songs of a genre.
Aspectrogram is a visual representation of the frequencies that occur in a song. Itshows the intensity for the frequencies at the y axis in the specified time intervals atthe x axis. That is, the darker the color, the stronger the frequency is in the particulartime window of the song.Matplotlib provides the convenient function specgram() that performs most of theunder-the-hood calculation and plotting for us:>>> import scipy>>> from matplotlib.pyplot import specgram>>> sample_rate, X = scipy.io.wavfile.read(wave_filename)>>> print sample_rate, X.shape22050, (661794,)>>> specgram(X, Fs=sample_rate, xextent=(0,30))The WAV file we just read in was sampled at a rate of 22,050 Hz andcontains 661,794 samples.[ 201 ]Classification – Music Genre ClassificationIf we now plot the spectrogram for these first 30 seconds for diverse WAV files, wecan see that there are commonalities between songs of the same genre, as shown inthe following image:Just glancing at the image, we immediately see the difference in the spectrumbetween, for example, metal and classical songs.
While metal songs have highintensity over most of the frequency spectrum all the time (they're energetic!),classical songs show a more diverse pattern over time.[ 202 ]Chapter 9It should be possible to train a classifier that discriminates at least between Metal andClassical songs with high enough accuracy. Other genre pairs like Country and Rockcould pose a bigger challenge, though. This looks like a real challenge to us, since weneed to discriminate not only between two classes, but between six.
We need to beable to discriminate between all of them reasonably well.Decomposing music into sine wavecomponentsOur plan is to extract individual frequency intensities from the raw samplereadings (stored in X earlier) and feed them into a classifier. These frequencyintensities can be extracted by applying the so-called fast Fourier transform (FFT).As the theory behind FFT is outside the scope of this chapter, let us just look at anexample to get an intuition of what it accomplishes.
Later on, we will treat it as ablack box feature extractor.For example, let us generate two WAV files, sine_a.wav and sine_b.wav,that contain the sound of 400 Hz and 3,000 Hz sine waves respectively. Theaforementioned "Swiss Army Knife", SoX, is one way to achieve this:$ sox --null -r 22050 sine_a.wav synth 0.2 sine 400$ sox --null -r 22050 sine_b.wav synth 0.2 sine 3000In the following charts, we have plotted their first 0.008 seconds. Below we can seethe FFT of the sine waves. Not surprisingly, we see a spike at 400 Hz and 3,000 Hzbelow the corresponding sine waves.Now, let us mix them both, giving the 400 Hz sound half the volume of the3,000 Hz one:$ sox --combine mix --volume 1 sine_b.wav --volume 0.5 sine_a.wavsine_mix.wav[ 203 ]Classification – Music Genre ClassificationWe see two spikes in the FFT plot of the combined sound, of which the 3,000 Hzspike is almost double the size of the 400 Hz.[ 204 ]Chapter 9For real music, we quickly see that the FFT doesn't look as beautiful as in thepreceding toy example:Using FFT to build our first classifierNevertheless, we can now create some kind of musical fingerprint of a song usingFFT.
If we do that for a couple of songs and manually assign their correspondinggenres as labels, we have the training data that we can feed into our first classifier.Increasing experimentation agilityBefore we dive into the classifier training, let us first spend some thoughts onexperimentation agility. Although we have the word "fast" in FFT, it is much slowerthan the creation of the features in our text-based chapters. And because we are stillin an experimentation phase, we might want to think about how we could speed upthe whole feature creation process.[ 205 ]Classification – Music Genre ClassificationOf course, the creation of the FFT per file will be the same each time we are runningthe classifier. We could, therefore, cache it and read the cached FFT representationinstead of the complete WAV file.
We do this with the create_fft() function,which, in turn, uses scipy.fft() to create the FFT. For the sake of simplicity (andspeed!), let us fix the number of FFT components to the first 1,000 in this example.With our current knowledge, we do not know whether these are the most importantones with regard to music genre classification—only that they show the highestintensities in the preceding FFT example. If we would later want to use more orfewer FFT components, we would of course have to recreate the FFT representationsfor each sound file.import osimport scipydef create_fft(fn):sample_rate, X = scipy.io.wavfile.read(fn)fft_features = abs(scipy.fft(X)[:1000])base_fn, ext = os.path.splitext(fn)data_fn = base_fn + ".fft"scipy.save(data_fn, fft_features)We save the data using NumPy's save() function, which always appends .npyto the filename.
We only have to do this once for every WAV file needed fortraining or predicting.The corresponding FFT reading function is read_fft():import globdef read_fft(genre_list, base_dir=GENRE_DIR):X = []y = []for label, genre in enumerate(genre_list):genre_dir = os.path.join(base_dir, genre, "*.fft.npy")file_list = glob.glob(genre_dir)for fn in file_list:[ 206 ]Chapter 9fft_features = scipy.load(fn)X.append(fft_features[:1000])y.append(label)return np.array(X), np.array(y)In our scrambled music directory, we expect the following music genres:genre_list = ["classical", "jazz", "country", "pop", "rock", "metal"]Training the classifierLet us use the logistic regression classifier, which has already served us well in theChapter 6, Classification II - Sentiment Analysis. The added difficulty is that we arenow faced with a multiclass classification problem, whereas up to now we had todiscriminate only between two classes.Just to mention one aspect that is surprising is the evaluation of accuracy rateswhen first switching from binary to multiclass classification.
In binary classificationproblems, we have learned that an accuracy of 50 percent is the worst case, as it couldhave been achieved by mere random guessing. In multiclass settings, 50 percent canalready be very good. With our six genres, for instance, random guessing would resultin only 16.7 percent (equal class sizes assumed).Using a confusion matrix to measureaccuracy in multiclass problemsWith multiclass problems, we should not only be interested in how well we manageto correctly classify the genres.