Building machine learning systems with Python (779436), страница 39
Текст из файла (страница 39)
Thus, a word's waveform will not be identical every time it is spoken.However, by using clustering on these waveforms, we can hope to recover most ofthe structure so that all the instances of a given word are in the same cluster. Evenif the process is not perfect (and it will not be), we can still talk of grouping thewaveforms into words.We perform the same operation with image data: we cluster together similar lookingregions from all images and call these visual words.The number of words used does not usually have a big impact on the finalperformance of the algorithm.
Naturally, if the number is extremely small(10 or 20, when you have a few thousand images), then the overall systemwill not perform well. Similarly, if you have too many words (many morethan the number of images, for example), the system will also not performwell. However, in between these two extremes, there is often a very largeplateau, where you can choose the number of words without a big impacton the result.
As a rule of thumb, using a value such as 256, 512, or 1,024 ifyou have very many images should give you a good result.We are going to start by computing the features as follows:>>> alldescriptors = []>>> for im in images:...im = mh.imread(im, as_grey=True)...im = im.astype(np.uint8)...alldescriptors.append(surf.dense(image, spacing=16))>>> # get all descriptors into a single array>>> concatenated = np.concatenate(alldescriptors)>>> print('Number of descriptors: {}'.format(...len(concatenated)))Number of descriptors: 2489031[ 237 ]Computer VisionThis results in over 2 million local descriptors.
Now, we use k-means clusteringto obtain the centroids. We could use all the descriptors, but we are going to usea smaller sample for extra speed, as shown in the following:>>> # use only every 64th vector>>> concatenated = concatenated[::64]>>> from sklearn.cluster import KMeans>>> k = 256>>> km = KMeans(k)>>> km.fit(concatenated)After this is done (which will take a while), the km object contains information aboutthe centroids.
We now go back to the descriptors and build feature vectors as follows:>>> sfeatures = []>>> for d in alldescriptors:...c = km.predict(d)...sfeatures.append(......np.array([np.sum(c == ci) for ci in range(k)]))>>> # build single array and convert to float>>> sfeatures = np.array(sfeatures, dtype=float)The end result of this loop is that sfeatures[fi, fj] is the number of times thatthe image fi contains the element fj.
The same could have been computed fasterwith the np.histogram function, but getting the arguments just right is a little tricky.We convert the result to floating point as we do not want integer arithmetic (with itsrounding semantics).The result is that each image is now represented by a single array of features, ofthe same size (the number of clusters, in our case 256). Therefore, we can use ourstandard classification methods as follows:>>> scores = cross_validation.cross_val_score(...clf, sfeatures, labels, cv=cv)>>> print('Accuracy: {:.1%}'.format(scores.mean()))Accuracy: 62.6%This is worse than before! Have we gained nothing?[ 238 ]Chapter 10In fact, we have, as we can combine all features together to obtain 76.1 percentaccuracy, as follows:>>> combined = np.hstack([features, features])>>> scores = cross_validation.cross_val_score(...clf, combined, labels, cv=cv)>>> print('Accuracy: {:.1%}'.format(scores.mean()))Accuracy: 76.1%This is the best result we have, better than any single feature set.
This is due to thefact that the local SURF features are different enough to add new information to theglobal image features we had before and improve the combined result.SummaryWe learned the classical feature-based approach to handling images in a machinelearning context: by converting from a million pixels to a few numeric features, weare able to directly use a logistic regression classifier.
All of the technologies thatwe learned in the other chapters suddenly become directly applicable to imageproblems. We saw one example in the use of image features to find similar images ina dataset.We also learned how to use local features, in a bag of words model, for classification.This is a very modern approach to computer vision and achieves good results whilebeing robust to many irrelevant aspects of the image, such as illumination, andeven uneven illumination in the same image. We also used clustering as a usefulintermediate step in classification rather than as an end in itself.We focused on mahotas, which is one of the major computer vision libraries inPython. There are others that are equally well maintained.
Skimage (scikit-image)is similar in spirit, but has a different set of features. OpenCV is a very good C++library with a Python interface. All of these can work with NumPy arrays and youcan mix and match functions from different libraries to build complex computervision pipelines.In the next chapter, you will learn a different form of machine learning:dimensionality reduction.
As we saw in several earlier chapters, including whenusing images in this chapter, it is very easy to computationally generate manyfeatures. However, often we want to have a reduced number of features for speedand visualization, or to improve our results. In the next chapter, we will see how toachieve this.[ 239 ]Dimensionality ReductionGarbage in, garbage out—throughout the book, we saw this pattern also holds truewhen applying machine learning methods to training data. Looking back, we realizethat the most interesting machine learning challenges always involved some sort offeature engineering, where we tried to use our insight into the problem to carefullycrafted additional features that the machine learner hopefully picks up.In this chapter, we will go in the opposite direction with dimensionality reductioninvolving cutting away features that are irrelevant or redundant. Removing featuresmight seem counter-intuitive at first thought, as more information should alwaysbe better than less information.
Also, even if we had redundant features in ourdataset, would not the learning algorithm be able to quickly figure it out and settheir weights to 0? The following are several good reasons that are still in practicefor trimming down the dimensions as much as possible:• Superfluous features can irritate or mislead the learner. This is not the casewith all machine learning methods (for example, Support Vector Machineslove high dimensional spaces). However, most of the models feel safer withfewer dimensions.• Another argument against high dimensional feature spaces is that morefeatures mean more parameters to tune and a higher risk to overfit.• The data we retrieved to solve our task might have just artificially highdimensionality, whereas the real dimension might be small.• Fewer dimensions = faster training = more parameter variations to try out inthe same time frame = better end result.• Visualization—if we want to visualize the data we are restricted to two orthree dimensions.So, here we will show how to get rid of the garbage within our data while keepingthe real valuable part of it.[ 241 ]Dimensionality ReductionSketching our roadmapDimensionality reduction can be roughly grouped into feature selection andfeature extraction methods.
We already employed some kind of feature selectionin almost every chapter when we invented, analyzed, and then probably droppedsome features. In this chapter, we will present some ways that use statisticalmethods, namely correlation and mutual information, to be able to do so in vastfeature spaces. Feature extraction tries to transform the original feature space intoa lower-dimensional feature space. This is especially useful when we cannot getrid of features using selection methods, but still we have too many features for ourlearner. We will demonstrate this using principal component analysis (PCA), lineardiscriminant analysis (LDA), and multidimensional scaling (MDS).Selecting featuresIf we want to be nice to our machine learning algorithm, we provide it with featuresthat are not dependent on each other, yet highly dependent on the value to bepredicted. This means that each feature adds salient information.
Removing any ofthe features will lead to a drop in performance.If we have only a handful of features, we could draw a matrix of scatter plots (onescatter plot for every feature pair combination). Relationships between the featurescould then be easily spotted. For every feature pair showing an obvious dependence,we would then think of whether we should remove one of them or better design anewer, cleaner feature out of both.Most of the time, however, we have more than a handful of features to choose from.Just think of the classification task where we had a bag of words to classify thequality of an answer, which would require a 1,000 by 1,000 scatter plot.
In this case,we need a more automated way to detect overlapping features and to resolve them.We will present two general ways to do so in the following subsections, namelyfilters and wrappers.Detecting redundant features using filtersFilters try to clean up the feature forest independent of any later used machinelearning method. They rely on statistical methods to find which of the featuresare redundant or irrelevant.