The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 73
Текст из файла (страница 73)
This has some computational advantages since the resulting parameteroptimization problem is smooth, unlike the discrete split point search in thetree-based approach. The soft splits might also help in prediction accuracyand provide a useful alternative description of the data.There are other differences between HMEs and the CART implementation of trees. In an HME, a linear (or logistic regression) model is fit ineach terminal node, instead of a constant as in CART. The splits can bemultiway, not just binary, and the splits are probabilistic functions of alinear combination of inputs, rather than a single input as in the standarduse of CART.
However, the relative merits of these choices are not clear,and most were discussed at the end of Section 9.2.A simple two-level HME model in shown in Figure 9.13. It can be thoughtof as a tree with soft splits at each non-terminal node. However, the inventors of this methodology use a different terminology. The terminal nodesare called experts, and the non-terminal nodes are called gating networks.The idea is that each expert provides an opinion (prediction) about theresponse, and these are combined together by the gating networks. As wewill see, the model is formally a mixture model, and the two-level modelin the figure can be extend to multiple levels, hence the name hierarchicalmixtures of experts.3309.
Additive Models, Trees, and Related MethodsGatingNetworkg1g2GatingNetworkg1|1GatingNetworkg1|2g2|1g2|2ExpertNetworkExpertNetworkExpertNetworkExpertNetworkPr(y|x, θ11 )Pr(y|x, θ21 )Pr(y|x, θ12 )Pr(y|x, θ22 )FIGURE 9.13. A two-level hierarchical mixture of experts (HME) model.Consider the regression or classification problem, as described earlier inthe chapter. The data is (xi , yi ), i = 1, 2, .
. . , N , with yi either a continuousor binary-valued response, and xi a vector-valued input. For ease of notation we assume that the first element of xi is one, to account for intercepts.Here is how an HME is defined. The top gating network has the outputTe γjgj (x, γj ) = PKxTk=1e γk x, j = 1, 2, . . . , K,(9.25)where each γj is a vector of unknown parameters. This represents a softK-way split (K = 2 in Figure 9.13.) Each gj (x, γj ) is the probability ofassigning an observation with feature vector x to the jth branch. Noticethat with K = 2 groups, if we take the coefficient of one of the elements ofx to be +∞, then we get a logistic curve with infinite slope.
In this case,the gating probabilities are either 0 or 1, corresponding to a hard split onthat input.At the second level, the gating networks have a similar form:Teγjℓ xgℓ|j (x, γjℓ ) = PKk=1Teγjk x, ℓ = 1, 2, . . . , K.(9.26)9.5 Hierarchical Mixtures of Experts331This is the probability of assignment to the ℓth branch, given assignmentto the jth branch at the level above.At each expert (terminal node), we have a model for the response variableof the formY ∼ Pr(y|x, θjℓ ).(9.27)This differs according to the problem.Regression: The Gaussian linear regression model is used, with θjℓ =2(βjℓ , σjℓ):T2Y = βjℓx + ε and ε ∼ N (0, σjℓ).(9.28)Classification: The linear logistic regression model is used:Pr(Y = 1|x, θjℓ ) =1T1 + e−θjℓ x.(9.29)Denoting the collection of all parameters by Ψ = {γj , γjℓ , θjℓ }, the totalprobability that Y = y isPr(y|x, Ψ) =KXj=1gj (x, γj )KXgℓ|j (x, γjℓ )Pr(y|x, θjℓ ).(9.30)ℓ=1This is a mixture model, with the mixture probabilities determined by thegating network models.PTo estimate the parameters, we maximize the log-likelihood of the data,i log Pr(yi |xi , Ψ), over the parameters in Ψ.
The most convenient methodfor doing this is the EM algorithm, which we describe for mixtures inSection 8.5. We define latent variables ∆j , all of which are zero except fora single one. We interpret these as the branching decisions made by the toplevel gating network.
Similarly we define latent variables ∆ℓ|j to describethe gating decisions at the second level.In the E-step, the EM algorithm computes the expectations of the ∆jand ∆ℓ|j given the current values of the parameters. These expectationsare then used as observation weights in the M-step of the procedure, toestimate the parameters in the expert networks. The parameters in theinternal nodes are estimated by a version of multiple logistic regression.The expectations of the ∆j or ∆ℓ|j are probability profiles, and these areused as the response vectors for these logistic regressions.The hierarchical mixtures of experts approach is a promising competitorto CART trees.
By using soft splits rather than hard decision rules it cancapture situations where the transition from low to high response is gradual.The log-likelihood is a smooth function of the unknown weights and henceis amenable to numerical optimization. The model is similar to CART withlinear combination splits, but the latter is more difficult to optimize. On3329. Additive Models, Trees, and Related Methodsthe other hand, to our knowledge there are no methods for finding a goodtree topology for the HME model, as there are in CART.
Typically one usesa fixed tree of some depth, possibly the output of the CART procedure.The emphasis in the research on HMEs has been on prediction rather thaninterpretation of the final model. A close cousin of the HME is the latentclass model (Lin et al., 2000), which typically has only one layer; herethe nodes or latent classes are interpreted as groups of subjects that showsimilar response behavior.9.6 Missing DataIt is quite common to have observations with missing values for one or moreinput features.
The usual approach is to impute (fill-in) the missing valuesin some way.However, the first issue in dealing with the problem is determining whether the missing data mechanism has distorted the observed data. Roughlyspeaking, data are missing at random if the mechanism resulting in itsomission is independent of its (unobserved) value. A more precise definitionis given in Little and Rubin (2002). Suppose y is the response vector and Xis the N × p matrix of inputs (some of which are missing). Denote by Xobsthe observed entries in X and let Z = (y, X), Zobs = (y, Xobs ).
Finally, if Ris an indicator matrix with ijth entry 1 if xij is missing and zero otherwise,then the data is said to be missing at random (MAR) if the distribution ofR depends on the data Z only through Zobs :Pr(R|Z, θ) = Pr(R|Zobs , θ).(9.31)Here θ are any parameters in the distribution of R. Data are said to bemissing completely at random (MCAR) if the distribution of R doesn’tdepend on the observed or missing data:Pr(R|Z, θ) = Pr(R|θ).(9.32)MCAR is a stronger assumption than MAR: most imputation methods relyon MCAR for their validity.For example, if a patient’s measurement was not taken because the doctorfelt he was too sick, that observation would not be MAR or MCAR.
In thiscase the missing data mechanism causes our observed training data to give adistorted picture of the true population, and data imputation is dangerousin this instance. Often the determination of whether features are MCARmust be made from information about the data collection process. Forcategorical features, one way to diagnose this problem is to code “missing”as an additional class. Then we fit our model to the training data and seeif class “missing” is predictive of the response.9.6 Missing Data333Assuming the features are missing completely at random, there are anumber of ways of proceeding:1.
Discard observations with any missing values.2. Rely on the learning algorithm to deal with missing values in itstraining phase.3. Impute all missing values before training.Approach (1) can be used if the relative amount of missing data is small,but otherwise should be avoided. Regarding (2), CART is one learningalgorithm that deals effectively with missing values, through surrogate splits(Section 9.2.4). MARS and PRIM use similar approaches. In generalizedadditive modeling, all observations missing for a given input feature areomitted when the partial residuals are smoothed against that feature inthe backfitting algorithm, and their fitted values are set to zero.
Since thefitted curves have mean zero (when the model includes an intercept), thisamounts to assigning the average fitted value to the missing observations.For most learning methods, the imputation approach (3) is necessary.The simplest tactic is to impute the missing value with the mean or medianof the nonmissing values for that feature. (Note that the above procedurefor generalized additive models is analogous to this.)If the features have at least some moderate degree of dependence, onecan do better by estimating a predictive model for each feature given theother features and then imputing each missing value by its prediction fromthe model.
In choosing the learning method for imputation of the features,one must remember that this choice is distinct from the method used forpredicting y from X. Thus a flexible, adaptive method will often be preferred, even for the eventual purpose of carrying out a linear regression of yon X. In addition, if there are many missing feature values in the trainingset, the learning method must itself be able to deal with missing featurevalues. CART therefore is an ideal choice for this imputation “engine.”After imputation, missing values are typically treated as if they were actually observed.
This ignores the uncertainty due to the imputation, whichwill itself introduce additional uncertainty into estimates and predictionsfrom the response model. One can measure this additional uncertainty bydoing multiple imputations and hence creating many different training sets.The predictive model for y can be fit to each training set, and the variationacross training sets can be assessed. If CART was used for the imputationengine, the multiple imputations could be done by sampling from the valuesin the corresponding terminal nodes.3349.
Additive Models, Trees, and Related Methods9.7 Computational ConsiderationsWith N observations and p predictors, additive model fitting requires somenumber mp of applications of a one-dimensional smoother or regressionmethod. The required number of cycles m of the backfitting algorithm isusually less than 20 and often less than 10, and depends on the amountof correlation in the inputs. With cubic smoothing splines, for example,N log N operations are needed for an initial sort and N operations for thespline fit.
Hence the total operations for an additive model fit is pN log N +mpN .Trees require pN log N operations for an initial sort for each predictor,and typically another pN log N operations for the split computations. If thesplits occurred near the edges of the predictor ranges, this number couldincrease to N 2 p.MARS requires N m2 + pmN operations to add a basis function to amodel with m terms already present, from a pool of p predictors. Hence tobuild an M -term model requires N M 3 + pM 2 N computations, which canbe quite prohibitive if M is a reasonable fraction of N .Each of the components of an HME are typically inexpensive to fit ateach M-step: N p2 for the regressions, and N p2 K 2 for a K-class logisticregression.