Building machine learning systems with Python (779436), страница 10
Текст из файла (страница 10)
Take a look at the following plot:Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seedsas triangles. Their respective areas are shown as white, black, and grey. You mightbe wondering why the regions are so horizontal, almost weirdly so. The problem isthat the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from0.75 to 1.0.
This means that a small change in x is actually much larger than a smallchange in y. So, when we compute the distance between points, we are, for the mostpart, only taking the x axis into account. This is also a good example of why it is agood idea to visualize our data and look for red flags or surprises.[ 45 ]Classifying with Real-world ExamplesIf you studied physics (and you remember your lessons), you might have alreadynoticed that we had been summing up lengths, areas, and dimensionless quantities,mixing up our units (which is something you never want to do in a physical system).We need to normalize all of the features to a common scale. There are many solutionsto this problem; a simple one is to normalize to z-scores.
The z-score of a value ishow far away from the mean it is, in units of standard deviation. It comes downto this operation:f′=f −µσIn this formula, f is the old feature value, f' is the normalized feature value, µ is themean of the feature, and σ is the standard deviation. Both µ and σ are estimated fromtraining data.
Independent of what the original values were, after z-scoring, a valueof zero corresponds to the training mean, positive values are above the mean, andnegative values are below it.The scikit-learn module makes it very easy to use this normalization as apreprocessing step. We are going to use a pipeline of transformations: the firstelement will do the transformation and the second element will do the classification.We start by importing both the pipeline and the feature scaling classes as follows:>>> from sklearn.pipeline import Pipeline>>> from sklearn.preprocessing import StandardScalerNow, we can combine them.>>> classifier = KNeighborsClassifier(n_neighbors=1)>>> classifier = Pipeline([('norm', StandardScaler()),...('knn', classifier)])The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to astep in the pipeline: the first element is a string naming the step, while the secondelement is the object that performs the transformation.
Advanced usage of the objectuses these names to refer to different steps.After normalization, every feature is in the same units (technically, every feature isnow dimensionless; it has no units) and we can more confidently mix dimensions.In fact, if we now run our nearest neighbor classifier, we obtain 93 percent accuracy,estimated with the same five-fold cross-validation code shown previously![ 46 ]Look at the decision space again in two dimensions:The boundaries are now different and you can see that both dimensionsmake a difference for the outcome. In the full dataset, everything is happeningon a seven-dimensional space, which is very hard to visualize, but the sameprinciple applies; while a few dimensions are dominant in the original data,after normalization, they are all given the same importance.Binary and multiclass classificationThe first classifier we used, the threshold classifier, was a simple binary classifier.
Itsresult is either one class or the other, as a point is either above the threshold value orit is not. The second classifier we used, the nearest neighbor classifier, was a naturalmulticlass classifier, its output can be one of the several classes.[ 47 ]Classifying with Real-world ExamplesIt is often simpler to define a simple binary method than the one that works onmulticlass problems. However, we can reduce any multiclass problem to a series ofbinary decisions. This is what we did earlier in the Iris dataset, in a haphazard way:we observed that it was easy to separate one of the initial classes and focused on theother two, reducing the problem to two binary decisions:1. Is it an Iris Setosa (yes or no)?2.
If not, check whether it is an Iris Virginica (yes or no).Of course, we want to leave this sort of reasoning to the computer. As usual, thereare several solutions to this multiclass reduction.The simplest is to use a series of one versus the rest classifiers. For each possiblelabel ℓ, we build a classifier of the type is this ℓ or something else? When applyingthe rule, exactly one of the classifiers will say yes and we will have our solution.Unfortunately, this does not always happen, so we have to decide how to dealwith either multiple positive answers or no positive answers.Alternatively, we can build a classification tree.
Split the possible labels into two,and build a classifier that asks, "Should this example go in the left or the rightbin?" We can perform this splitting recursively until we obtain a single label. Thepreceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond isa single binary classifier. It is easy to imagine that we could make this tree larger andencompass more decisions. This means that any classifier that can be used for binaryclassification can also be adapted to handle any number of classes in a simple way.[ 48 ]There are many other possible ways of turning a binary method into a multiclass one.There is no single method that is clearly better in all cases.
The scikit-learn moduleimplements several of these methods in the sklearn.multiclass submodule.Some classifiers are binary systems, while many real-life problems arenaturally multiclass. Several simple protocols reduce a multiclass problemto a series of binary decisions and allow us to apply the binary models toour multiclass problem. This means methods that are apparently only forbinary data can be applied to multiclass data with little extra effort.SummaryClassification means generalizing from examples to build a model (that is, a rulethat can automatically be applied to new, unclassified objects). It is one of thefundamental tools in machine learning and we will see many more examplesof this in the forthcoming chapters.In a sense, this was a very theoretical chapter, as we introduced generic conceptswith simple examples. We went over a few operations with the Iris dataset.
This is asmall dataset. However, it has the advantage that we were able to plot it out and seewhat we were doing in detail. This is something that will be lost when we move onto problems with many dimensions and many thousands of examples. The intuitionswe gained here will all still be valid.You also learned that the training error is a misleading, over-optimistic estimate of howwell the model does.
We must, instead, evaluate it on testing data that has not beenused for training. In order to not waste too many examples in testing, a cross-validationschedule can get us the best of both worlds (at the cost of more computation).We also had a look at the problem of feature engineering.
Features are notpredefined for you, but choosing and designing features is an integral part ofdesigning a machine learning pipeline. In fact, it is often the area where you canget the most improvements in accuracy, as better data beats fancier methods. Thechapters on text-based classification, music genre recognition, and computer visionwill provide examples for these specific settings.The next chapter looks at how to proceed when your data does not have predefinedclasses for classification.[ 49 ]Clustering – FindingRelated PostsIn the previous chapter, you learned how to find the classes or categories ofindividual datapoints. With a handful of training data items that were paired withtheir respective classes, you learned a model, which we can now use to classifyfuture data items.
We called this supervised learning because the learning wasguided by a teacher; in our case, the teacher had the form of correct classifications.Let's now imagine that we do not possess those labels by which we can learn theclassification model. This could be, for example, because they were too expensive tocollect. Just imagine the cost if the only way to obtain millions of labels will be to askhumans to classify those manually. What could we have done in that case?Well, of course, we will not be able to learn a classification model. Still, we could findsome pattern within the data itself.