Building machine learning systems with Python (779436), страница 41
Текст из файла (страница 41)
So, we still have to specify where to make the cut,what part of the features are we willing to take, and what part do we want to drop?Coming back to scikit-learn, we find various excellent wrapper classes in the sklearn.feature_selection package. A real workhorse in this field is RFE, which stands forrecursive feature elimination. It takes an estimator and the desired number of featuresto keep as parameters and then trains the estimator with various feature sets as long asit has found a subset of features that is small enough.
The RFE instance itself pretendsto be like an estimator, thereby, indeed, wrapping the provided estimator.In the following example, we create an artificial classification problem of 100 samplesusing datasets' convenient make_classification() function. It lets us specifythe creation of 10 features, out of which only three are really valuable to solve theclassification problem:>>> from sklearn.feature_selection import RFE>>> from sklearn.linear_model import LogisticRegression>>> from sklearn.datasets import make_classification>>> X,y = make_classification(n_samples=100, n_features=10,n_informative=3, random_state=0)>>> clf = LogisticRegression()>>> clf.fit(X, y)>>> selector = RFE(clf, n_features_to_select=3)>>> selector = selector.fit(X, y)>>> print(selector.support_)[FalseTrue FalseTrue False False False FalseTrue False]>>> print(selector.ranking_)[4 1 3 1 8 5 7 6 1 2]The problem in real-world scenarios is, of course, how can we know the right valuefor n_features_to_select? Truth is, we can't.
However, most of the time we canuse a sample of the data and play with it using different settings to quickly get afeeling for the right ballpark.[ 252 ]Chapter 11The good thing is that we don't have to be that exact using wrappers. Let's try differentvalues for n_features_to_select to see how support_ and ranking_ change:n_features_support_ranking_1[False False False True False False False False FalseFalse][ 6 3 5 1 10 7 98 2 4]2[False False False True False False False False TrueFalse][5 2 4 1 9 6 8 7 1 3]3[False True False True False False False False TrueFalse][4 1 3 1 8 5 7 6 1 2]4[False True False True False False False False TrueTrue][3 1 2 1 7 4 6 5 1 1]5[False True True True False False False False TrueTrue][2 1 1 1 6 3 5 4 1 1]6[ True True True True False False False False TrueTrue][1 1 1 1 5 2 4 3 1 1]7[ True True True True False True False False TrueTrue][1 1 1 1 4 1 3 2 1 1]8[ True True True True False True False True TrueTrue][1 1 1 1 3 1 2 1 1 1]9[ True True True True False True True True TrueTrue][1 1 1 1 2 1 1 1 1 1]10[ True True True True True True True True TrueTrue][1 1 1 1 1 1 1 1 1 1]to_selectWe see that the result is very stable.
Features that have been used when requestingsmaller feature sets keep on getting selected when letting more features in. At last,we rely on our train/test set splitting to warn us when we go the wrong way.Other feature selection methodsThere are several other feature selection methods that you will discover whilereading through machine learning literature. Some even don't look like being afeature selection method because they are embedded into the learning process (notto be confused with the aforementioned wrappers). Decision trees, for instance, havea feature selection mechanism implanted deep in their core. Other learning methodsemploy some kind of regularization that punishes model complexity, thus drivingthe learning process towards good performing models that are still "simple".
They dothis by decreasing the less impactful features importance to zero and then droppingthem (L1-regularization).[ 253 ]Dimensionality ReductionSo watch out! Often, the power of machine learning methods has to be attributed totheir implanted feature selection method to a great degree.Feature extractionAt some point, after we have removed redundant features and dropped irrelevantones, we, often, still find that we have too many features.
No matter what learningmethod we use, they all perform badly and given the huge feature space weunderstand that they actually cannot do better. We realize that we have to cut livingflesh; we have to get rid of features, for which all common sense tells us that theyare valuable. Another situation when we need to reduce the dimensions and featureselection does not help much is when we want to visualize data. Then, we need tohave at most three dimensions at the end to provide any meaningful graphs.Enter feature extraction methods.
They restructure the feature space to make it moreaccessible to the model or simply cut down the dimensions to two or three so that wecan show dependencies visually.Again, we can distinguish feature extraction methods as being linear or non-linearones. Also, as seen before in the Selecting features section, we will present one methodfor each type (principal component analysis as a linear and non-linear version ofmultidimensional scaling). Although, they are widely known and used, they are onlyrepresentatives for many more interesting and powerful feature extraction methods.About principal component analysisPrincipal component analysis (PCA) is often the first thing to try out if you want tocut down the number of features and do not know what feature extraction methodto use.
PCA is limited as it's a linear method, but chances are that it already goes farenough for your model to learn well enough. Add to this the strong mathematicalproperties it offers and the speed at which it finds the transformed feature space andis later able to transform between original and transformed features; we can almostguarantee that it also will become one of your frequently used machine learning tools.Summarizing it, given the original feature space, PCA finds a linear projection ofitself in a lower dimensional space that has the following properties:• The conserved variance is maximized.• The final reconstruction error (when trying to go back from transformedfeatures to original ones) is minimized.As PCA simply transforms the input data, it can be applied both to classificationand regression problems.
In this section, we will use a classification task to discussthe method.[ 254 ]Chapter 11Sketching PCAPCA involves a lot of linear algebra, which we do not want to go into. Nevertheless,the basic algorithm can be easily described as follows:1. Center the data by subtracting the mean from it.2. Calculate the covariance matrix.3. Calculate the eigenvectors of the covariance matrix.If we start with N features, then the algorithm will return a transformed featurespace again with N dimensions (we gained nothing so far). The nice thing aboutthis algorithm, however, is that the eigenvalues indicate how much of the varianceis described by the corresponding eigenvector.Let's assume we start with N = 1000 features and we know that our model does notwork well with more than 20 features. Then, we simply pick the 20 eigenvectorswith the highest eigenvalues.Applying PCALet's consider the following artificial dataset, which is visualized in the following leftplot diagram:>>> x1 = np.arange(0, 10, .2)>>> x2 = x1+np.random.normal(loc=0, scale=1, size=len(x1))>>> X = np.c_[(x1, x2)]>>> good = (x1>5) | (x2>5) # some arbitrary classes>>> bad = ~good # to make the example look good[ 255 ]Dimensionality ReductionScikit-learn provides the PCA class in its decomposition package.
In this example, wecan clearly see that one dimension should be enough to describe the data. We canspecify this using the n_components parameter:>>> from sklearn import linear_model, decomposition, datasets>>> pca = decomposition.PCA(n_components=1)Also, here we can use the fit() and transform() methods of pca (or its fit_transform() combination) to analyze the data and project it in the transformedfeature space:>>> Xtrans = pca.fit_transform(X)As we have specified, Xtrans contains only one dimension.
You can see the result inthe preceding right plot diagram. The outcome is even linearly separable in this case.We would not even need a complex classifier to distinguish between both classes.To get an understanding of the reconstruction error, we can have a look at thevariance of the data that we have retained in the transformation:>>> print(pca.explained_variance_ratio_)>>> [ 0.96393127]This means that after going from two to one dimension, we are still left with96 percent of the variance.Of course, it's not always this simple.
Oftentimes, we don't know what number ofdimensions is advisable upfront. In that case, we leave n_components parameterunspecified when initializing PCA to let it calculate the full transformation. Afterfitting the data, explained_variance_ratio_ contains an array of ratios indecreasing order: The first value is the ratio of the basis vector describing thedirection of the highest variance, the second value is the ratio of the direction of thesecond highest variance, and so on. After plotting this array, we quickly get a feelof how many components we would need: the number of components immediatelybefore the chart has its elbow is often a good guess.Plots displaying the explained variance over the number ofcomponents is called a Scree plot. A nice example of combining a Screeplot with a grid search to find the best setting for the classificationproblem can be found at http://scikit-learn.sourceforge.net/stable/auto_examples/plot_digits_pipe.html.[ 256 ]Chapter 11Limitations of PCA and how LDA can helpBeing a linear method, PCA has, of course, its limitations when we are faced withdata that has non-linear relationships.