Building machine learning systems with Python (779436), страница 9
Текст из файла (страница 9)
For example, in spam filtering, it maybe worse to delete a good e-mail than to erroneously let a bad e-mail through.In that case, we may want to choose a model that is conservative in throwingout e-mails rather than the one that just makes the fewest mistakes overall.We can discuss these issues in terms of gain (which we want to maximize) orloss (which we want to minimize). They are equivalent, but sometimes one ismore convenient than the other.We can play around with these three aspects of classifiers and get different systems.A simple threshold is one of the simplest models available in machine learninglibraries and only works well when the problem is very simple, such as with theIris dataset.
In the next section, we will tackle a more difficult classification task thatrequires a more complex structure.In our case, we optimized the threshold to minimize the number of errors.Alternatively, we might have different loss functions. It might be that one type oferror is much costlier than the other.
In a medical setting, false negatives and falsepositives are not equivalent. A false negative (when the result of a test comes backnegative, but that is false) might lead to the patient not receiving treatment for aserious disease. A false positive (when the test comes back positive even though thepatient does not actually have that disease) might lead to additional tests to confirmor unnecessary treatment (which can still have costs, including side effects from thetreatment, but are often less serious than missing a diagnostic).
Therefore, dependingon the exact setting, different trade-offs can make sense. At one extreme, if thedisease is fatal and the treatment is cheap with very few negative side-effects, thenyou want to minimize false negatives as much as you can.What the gain/cost function should be is always dependent on theexact problem you are working on. When we present a general-purposealgorithm, we often focus on minimizing the number of mistakes,achieving the highest accuracy. However, if some mistakes are costlierthan others, it might be better to accept a lower overall accuracy tominimize the overall costs.[ 40 ]A more complex dataset and a morecomplex classifierWe will now look at a slightly more complex dataset. This will motivate theintroduction of a new classification algorithm and a few other ideas.Learning about the Seeds datasetWe now look at another agricultural dataset, which is still small, but already toolarge to plot exhaustively on a page as we did with Iris.
This dataset consists ofmeasurements of wheat seeds. There are seven features that are present, whichare as follows:• area A• perimeter P• compactness C = 4πA/P²• length of kernel• width of kernel• asymmetry coefficient• length of kernel grooveThere are three classes, corresponding to three wheat varieties: Canadian, Koma,and Rosa. As earlier, the goal is to be able to classify the species based on thesemorphological measurements. Unlike the Iris dataset, which was collected in the1930s, this is a very recent dataset and its features were automatically computedfrom digital images.This is how image pattern recognition can be implemented: you can take images,in digital form, compute a few relevant features from them, and use a genericclassification system.
In Chapter 10, Computer Vision, we will work through thecomputer vision side of this problem and compute features in images. For themoment, we will work with the features that are given to us.UCI Machine Learning Dataset RepositoryThe University of California at Irvine (UCI) maintains an onlinerepository of machine learning datasets (at the time of writing, theylist 233 datasets). Both the Iris and the Seeds dataset used in thischapter were taken from there.The repository is available online at http://archive.ics.uci.edu/ml/.[ 41 ]Classifying with Real-world ExamplesFeatures and feature engineeringOne interesting aspect of these features is that the compactness feature is not actuallya new measurement, but a function of the previous two features, area and perimeter.It is often very useful to derive new combined features.
Trying to create new featuresis generally called feature engineering. It is sometimes seen as less glamorous thanalgorithms, but it often matters more for performance (a simple algorithm on wellchosen features will perform better than a fancy algorithm on not-so-good features).In this case, the original researchers computed the compactness, which is a typicalfeature for shapes. It is also sometimes called roundness. This feature will have thesame value for two kernels, one of which is twice as big as the other one, but with thesame shape. However, it will have different values for kernels that are very round(when the feature is close to one) when compared to kernels that are elongated(when the feature is closer to zero).The goals of a good feature are to simultaneously vary with what matters (thedesired output) and be invariant with what does not.
For example, compactness doesnot vary with size, but varies with the shape. In practice, it might be hard to achieveboth objectives perfectly, but we want to approximate this ideal.You will need to use background knowledge to design good features. Fortunately,for many problem domains, there is already a vast literature of possible features andfeature-types that you can build upon. For images, all of the previously mentionedfeatures are typical and computer vision libraries will compute them for you. Intext-based problems too, there are standard solutions that you can mix and match(we will also see this in the next chapter).
When possible, you should use yourknowledge of the problem to design a specific feature or to select which ones fromthe literature are more applicable to the data at hand.Even before you have data, you must decide which data is worthwhile to collect.Then, you hand all your features to the machine to evaluate and compute thebest classifier.A natural question is whether we can select good features automatically. Thisproblem is known as feature selection. There are many methods that have beenproposed for this problem, but in practice very simple ideas work best. For the smallproblems we are currently exploring, it does not make sense to use feature selection,but if you had thousands of features, then throwing out most of them might makethe rest of the process much faster.[ 42 ]Nearest neighbor classificationFor use with this dataset, we will introduce a new classifier: the nearest neighborclassifier.
The nearest neighbor classifier is very simple. When classifying a newelement, it looks at the training data for the object that is closest to it, its nearestneighbor. Then, it returns its label as the answer. Notice that this model performsperfectly on its training data! For each point, its closest neighbor is itself, and soits label matches perfectly (unless two examples with different labels have exactlythe same feature values, which will indicate that the features you are using arenot very descriptive). Therefore, it is essential to test the classification using across-validation protocol.The nearest neighbor method can be generalized to look not at a single neighbor,but to multiple ones and take a vote amongst the neighbors.
This makes the methodmore robust to outliers or mislabeled data.Classifying with scikit-learnWe have been using handwritten classification code, but Python is a veryappropriate language for machine learning because of its excellent libraries. Inparticular, scikit-learn has become the standard library for many machine learningtasks, including classification.
We are going to use its implementation of nearestneighbor classification in this section.The scikit-learn classification API is organized around classifier objects. These objectshave the following two essential methods:• fit(features, labels): This is the learning step and fits the parameters ofthe model• predict(features): This method can only be called after fit and returns aprediction for one or more inputsHere is how we could use its implementation of k-nearest neighbors for our data.
Westart by importing the KneighborsClassifier object from the sklearn.neighborssubmodule:>>> from sklearn.neighbors import KNeighborsClassifierThe scikit-learn module is imported as sklearn (sometimes you will also find thatscikit-learn is referred to using this short name instead of the full name). All of thesklearn functionality is in submodules, such as sklearn.neighbors.[ 43 ]Classifying with Real-world ExamplesWe can now instantiate a classifier object. In the constructor, we specify the numberof neighbors to consider, as follows:>>> classifier = KNeighborsClassifier(n_neighbors=1)If we do not specify the number of neighbors, it defaults to 5, which is often a goodchoice for classification.We will want to use cross-validation (of course) to look at our data.
The scikit-learnmodule also makes this easy:>>> from sklearn.cross_validation import KFold>>> kf = KFold(len(features), n_folds=5, shuffle=True)>>> # `means` will be a list of mean accuracies (one entry per fold)>>> means = []>>> for training,testing in kf:...# We fit a model for this fold, then apply it to the...# testing data with `predict`:...classifier.fit(features[training], labels[training])...prediction = classifier.predict(features[testing])......# np.mean on an array of booleans returns fraction...# of correct decisions for this fold:...curmean = np.mean(prediction == labels[testing])...means.append(curmean)>>> print("Mean accuracy: {:.1%}".format(np.mean(means)))Mean accuracy: 90.5%Using five folds for cross-validation, for this dataset, with this algorithm, we obtain90.5 percent accuracy.
As we discussed in the earlier section, the cross-validationaccuracy is lower than the training accuracy, but this is a more credible estimateof the performance of the model.[ 44 ]Looking at the decision boundariesWe will now examine the decision boundary. In order to plot these on paper, we willsimplify and look at only two dimensions.