Building machine learning systems with Python (779436), страница 40
Текст из файла (страница 40)
In case of redundant features, it keeps only one perredundant feature group. Irrelevant features will simply be removed. In general,the filter works as depicted in the following workflow:[ 242 ]Chapter 11yAll featuresx1, x2, ..., xNSelect featuresthat are notredundantSome featuresx2, x7, ..., xMSelect featuresthat are notirrelevantResultingfeaturesx2, x10, x14CorrelationUsing correlation, we can easily see linear relationships between pairs of features.In the following graphs, we can see different degrees of correlation, together witha potential linear dependency plotted as a red-dashed line (fitted 1-dimensionalpolynomial).
The correlation coefficient Cor ( X 1 , X 2 ) at the top of the individual graphsis calculated using the common Pearson correlation coefficient (Pearson r value) bymeans of the pearsonr() function of scipy.stat.Given two equal-sized data series, it returns a tuple of the correlation coefficientvalue and the p-value. The p-value describes how likely it is that the data series hasbeen generated by an uncorrelated system. In other words, the higher the p-value,the less we should trust the correlation coefficient:>>> from scipy.stats import pearsonr>>> pearsonr([1,2,3], [1,2,3.1])>>> (0.99962228516121843, 0.017498096813278487)>>> pearsonr([1,2,3], [1,20,6])>>> (0.25383654128340477, 0.83661493668227405)In the first case, we have a clear indication that both series are correlated.
In thesecond case, we still have a clearly non-zero r value.[ 243 ]Dimensionality ReductionHowever, the p-value of 0.84 tells us that the correlation coefficient is not significantand we should not pay too close attention to it. Have a look at the following graphs:In the first three cases that have high correlation coefficients, we would probablywant to throw out either X 1 or X 2 because they seem to convey similar, if not thesame, information.In the last case, however, we should keep both features. In our application, thisdecision would, of course, be driven by this p-value.[ 244 ]Chapter 11Although, it worked nicely in the preceding example, reality is seldom nice to us.One big disadvantage of correlation-based feature selection is that it only detectslinear relationships (a relationship that can be modelled by a straight line).
If we usecorrelation on a non-linear data, we see the problem. In the following example, wehave a quadratic relationship:[ 245 ]Dimensionality ReductionAlthough, the human eye immediately sees the relationship between X1 and X2in all but the bottom-right graph, the correlation coefficient does not.
It's obviousthat correlation is useful to detect linear relationships, but fails for everythingelse. Sometimes, it already helps to apply simple transformations to get a linearrelationship. For instance, in the preceding plot, we would have got a highcorrelation coefficient if we had drawn X2 over X1 squared. Normal data,however, does not often offer this opportunity.Luckily, for non-linear relationships, mutual information comes to the rescue.Mutual informationWhen looking at the feature selection, we should not focus on the type ofrelationship as we did in the previous section (linear relationships).
Instead, weshould think in terms of how much information one feature provides (given thatwe already have another).To understand this, let's pretend that we want to use features from house_size,number_of_levels, and avg_rent_price feature set to train a classifier that outputswhether the house has an elevator or not. In this example, we intuitively see thatknowing house_size we don't need to know number_of_levels anymore, as itcontains, somehow, redundant information. With avg_rent_price, it's differentbecause we cannot infer the value of rental space simply from the size of the houseor the number of levels it has.
Thus, it would be wise to keep only one of them inaddition to the average price of rental space.Mutual information formalizes the aforementioned reasoning by calculating howmuch information two features have in common. However, unlike correlation, itdoes not rely on a sequence of data, but on the distribution. To understand how itworks, we have to dive a bit into information entropy.Let's assume we have a fair coin. Before we flip it, we will have maximumuncertainty as to whether it will show heads or tails, as both have an equalprobability of 50 percent. This uncertainty can be measured by means ofClaude Shannon's information entropy:nH ( X ) = −∑ p ( X i ) log 2 p ( X i )i =1In our fair coin case, we have two cases: Let x0 be the case of head and x1 the case oftail with p ( X 0 ) = p ( X 1 ) = 0.5 .[ 246 ]Chapter 11Thus, it concludes to:H ( X ) = − p ( x0 ) log 2 p ( x0 ) − p ( x1 ) log 2 p ( x1 ) = −0.5 ⋅ log 2 ( 0.5 ) − 0.5⋅ log 2 ( 0.5 ) = 1.0For convenience, we can also use scipy.stats.entropy([0.5,0.5], base=2).
We set the base parameter to 2 to get the sameresult as earlier. Otherwise, the function will use the naturallogarithm via np.log(). In general, the base does not matter (aslong as you use it consistently).Now, imagine we knew upfront that the coin is actually not that fair with headshaving a chance of 60 percent showing up after flipping:H ( X ) = −0.6 ⋅ log 2 ( 0.6 ) − 0.4 ⋅ log 2 ( 0.4 ) = 0.97We see that this situation is less uncertain. The uncertainty will decrease the fartheraway we get from 0.5 reaching the extreme value of 0 for either 0 percent or 100percent of heads showing up, as we can see in the following graph:[ 247 ]Dimensionality ReductionWe will now modify entropy H ( X ) by applying it to two features instead of one insuch a way that it measures how much uncertainty is removed from X when we learnabout Y.
Then, we can catch how one feature reduces the uncertainty of another.For example, without having any further information about the weather, we aretotally uncertain whether it's raining outside or not. If we now learn that the grassoutside is wet, the uncertainty has been reduced (we will still have to check whetherthe sprinkler had been turned on).More formally, mutual information is defined as:mnI ( X ; Y ) = ∑∑ P ( X i , Y j ) log 2i =1 j =1P ( X i ,Yj )P ( X i ) P (Y j )This looks a bit intimidating, but is really not more than sums and products.For instance, the calculation of P () is done by binning the feature values andthen calculating the fraction of values in each bin.
In the following plots, wehave set the number of bins to ten.In order to restrict mutual information to the interval [0,1], we have to divide it bytheir added individual entropy, which gives us the normalized mutual information:NI ( X ; Y ) =I ( X ;Y )H ( X ) + H (Y )[ 248 ]Chapter 11The nice thing about mutual information is that unlike correlation, it does not lookonly at linear relationships, as we can see in the following graphs:[ 249 ]Dimensionality ReductionAs we can see, mutual information is able to indicate the strength of a linearrelationship.
The following diagram shows that, it also works for squared relationships:So, what we would have to do is to calculate the normalized mutual information forall feature pairs. For every pair having too high value (we would have to determinewhat this means), we would then drop one of them. In case of regression, we coulddrop this feature that has too low mutual information with the desired result value.This might work for a not too-big set of features. At some point, however, thisprocedure can be really expensive, because the amount of calculation growsquadratically (as we are computing the mutual information between feature pairs).[ 250 ]Chapter 11Another huge disadvantage of filters is that they drop features that seem to be notuseful in isolation.
More often than not, there are a handful of features that seem tobe totally independent of the target variable, yet when combined together they rock.To keep these, we need wrappers.Asking the model about the features usingwrappersWhile filters can help tremendously in getting rid of useless features, they cango only so far.
After all the filtering, there might still be some features that areindependent among themselves and show some degree of dependence with theresult variable, but yet they are totally useless from the model's point of view. Justthink of the following data that describes the XOR function. Individually, neither Anor B would show any signs of dependence on Y, whereas together they clearly do:ABY000011101110So, why not ask the model itself to give its vote on the individual features? This iswhat wrappers do, as we can see in the following process chart:yCurrentfeatures,initialized withall featuresx1, x2, ..., xNTrain modelwith y and checkthe importanceof individualfeaturesImportance ofindividualfeaturesFeature set too bigYesDrop featuresthat areunimportant[ 251 ]NoResultingfeaturesx2, x10, x14Dimensionality ReductionHere, we pushed the calculation of feature importance to the model training process.Unfortunately (but understandably), feature importance is not determined as abinary, but as a ranking value.