The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 4
Текст из файла (страница 4)
. . . . . . . . . .15.2 Definition of Random Forests . . . . . . .15.3 Details of Random Forests . . . . . . . .15.3.1 Out of Bag Samples . . . . . . .15.3.2 Variable Importance . . . . . . .15.3.3 Proximity Plots . . . . . . .
. .15.3.4 Random Forests and Overfitting15.4 Analysis of Random Forests . . . . . . . .15.4.1 Variance and the De-Correlation15.4.2 Bias . . . . . . . . . . . . . . . .15.4.3 Adaptive Nearest Neighbors . .Bibliographic Notes . . . . . . . . . . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . .
. .xxi.............58758758759259259359559659759760060160260316 Ensemble Learning16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .16.2 Boosting and Regularization Paths . . . . . . . . . . . . .16.2.1 Penalized Regression . . . . . . . . . . . . . . .16.2.2 The “Bet on Sparsity” Principle . . . .
. . . . .16.2.3 Regularization Paths, Over-fitting and Margins .16.3 Learning Ensembles . . . . . . . . . . . . . . . . . . . . .16.3.1 Learning a Good Ensemble . . . . . . . . . . . .16.3.2 Rule Ensembles . . . . . . . . . . . . . . . . . .Bibliographic Notes .
. . . . . . . . . . . . . . . . . . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60560560760761061361661762262362417 Undirected Graphical Models17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
.17.2 Markov Graphs and Their Properties . . . . . . . . . .17.3 Undirected Graphical Models for Continuous Variables17.3.1 Estimation of the Parameterswhen the Graph Structure is Known . . . . . .17.3.2 Estimation of the Graph Structure . . . . . . .17.4 Undirected Graphical Models for Discrete Variables . .17.4.1 Estimation of the Parameterswhen the Graph Structure is Known . . . .
. .17.4.2 Hidden Nodes . . . . . . . . . . . . . . . . . .17.4.3 Estimation of the Graph Structure . . . . . . .17.4.4 Restricted Boltzmann Machines . . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .639641642643645. . . .. . . ..
. . .. . . .. . . .. . . .. . . .. . . .Effect. . . .. . . .. . . .. . . .....................................................625. 625. 627. 630. 631. 635. 638.....18 High-Dimensional Problems: p ≫ N64918.1 When p is Much Bigger than N . . . . . . . . . . . . .
. 649xxiiContents18.2Diagonal Linear Discriminant Analysisand Nearest Shrunken Centroids . . . . . . . . . . . . . .18.3 Linear Classifiers with Quadratic Regularization . . . . .18.3.1 Regularized Discriminant Analysis . . . . . . . .18.3.2 Logistic Regressionwith Quadratic Regularization . . . . . . . . . .18.3.3 The Support Vector Classifier . . . . . . . . . .18.3.4 Feature Selection . .
. . . . . . . . . . . . . . . .18.3.5 Computational Shortcuts When p ≫ N . . . . .18.4 Linear Classifiers with L1 Regularization . . . . . . . . .18.4.1 Application of Lassoto Protein Mass Spectroscopy . . . . . . . . . .18.4.2 The Fused Lasso for Functional Data . . . . . .18.5 Classification When Features are Unavailable . . .
. . . .18.5.1 Example: String Kernelsand Protein Classification . . . . . . . . . . . . .18.5.2 Classification and Other Models UsingInner-Product Kernels and Pairwise Distances .18.5.3 Example: Abstracts Classification . . . . . . . .18.6 High-Dimensional Regression:Supervised Principal Components . . . . . . . . .
. . . .18.6.1 Connection to Latent-Variable Modeling . . . .18.6.2 Relationship with Partial Least Squares . . . . .18.6.3 Pre-Conditioning for Feature Selection . . . . .18.7 Feature Assessment and the Multiple-Testing Problem . .18.7.1 The False Discovery Rate . . . . . . . . . . . . .18.7.2 Asymmetric Cutpoints and the SAM Procedure18.7.3 A Bayesian Interpretation of the FDR . . . . . .18.8 Bibliographic Notes . . . . . . . . . . . .
. . . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .651654656657657658659661664666668668670672674678680681683687690692693694References699Author Index729Index737This is page 1Printer: Opaque this1IntroductionStatistical learning plays a key role in many areas of science, finance andindustry.
Here are some examples of learning problems:• Predict whether a patient, hospitalized due to a heart attack, willhave a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient.• Predict the price of a stock in 6 months from now, on the basis ofcompany performance measures and economic data.• Identify the numbers in a handwritten ZIP code, from a digitizedimage.• Estimate the amount of glucose in the blood of a diabetic person,from the infrared absorption spectrum of that person’s blood.• Identify the risk factors for prostate cancer, based on clinical anddemographic variables.The science of learning plays a key role in the fields of statistics, datamining and artificial intelligence, intersecting with areas of engineering andother disciplines.This book is about learning from data.
In a typical scenario, we havean outcome measurement, usually quantitative (such as a stock price) orcategorical (such as heart attack/no heart attack), that we wish to predictbased on a set of features (such as diet and clinical measurements). Wehave a training set of data, in which we observe the outcome and feature21.
IntroductionTABLE 1.1. Average percentage of words or characters in an email messageequal to the indicated word or character. We have chosen the words and charactersshowing the largest difference between spam and email.georgespamemailyou yourhp freehpl!ourreedu remove0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.011.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.290.280.01measurements for a set of objects (such as people). Using this data we builda prediction model, or learner, which will enable us to predict the outcomefor new unseen objects.
A good learner is one that accurately predicts suchan outcome.The examples above describe what is called the supervised learning problem. It is called “supervised” because of the presence of the outcome variable to guide the learning process. In the unsupervised learning problem,we observe only the features and have no measurements of the outcome.Our task is rather to describe how the data are organized or clustered. Wedevote most of this book to supervised learning; the unsupervised problemis less developed in the literature, and is the focus of Chapter 14.Here are some examples of real learning problems that are discussed inthis book.Example 1: Email SpamThe data for this example consists of information from 4601 email messages, in a study to try to predict whether the email was junk email, or“spam.” The objective was to design an automatic spam detector thatcould filter out spam before clogging the users’ mailboxes.
For all 4601email messages, the true outcome (email type) email or spam is available,along with the relative frequencies of 57 of the most commonly occurringwords and punctuation marks in the email message. This is a supervisedlearning problem, with the outcome the class variable email/spam. It is alsocalled a classification problem.Table 1.1 lists the words and characters showing the largest averagedifference between spam and email.Our learning method has to decide which features to use and how: forexample, we might use a rule such asif (%george < 0.6) & (%you > 1.5)then spamelse email.Another form of a rule might be:if (0.2 · %you − 0.3 · %george) > 0then spamelse email.7080ooo ooooooo o oo ooooo ooo oooo ooooo ooooooooooooooo ooooooooo oooooooooooooooo o oooo oo ooolcavoloo oooooooo ooooooooo oooo oooooooo ooooooooooooooo oooooo ooooo ooo oooooooo o ooooo o o ooo oooo o oooo oo ooo oooooo ooooooooooooo oo oooooo ooo oooooo ooooooooooooooooo ooo oo oooooo ooooooooo oo oooooooo o o ooooo ooo oo oo ooo ooo ooooo ooooo oooooo ooo o o ooooo ooooooooooo o oo o oooo ooo o ooo oo oooooooolweighto ooooooo ooo ooooooooooooooooooooo oo ooooooooooooo ooooooooooooooooooooo ooo o oo ooooo oooooo oooooo o ooo oo oo ooo ooooo oo oooooooo ooo ooo oooooooooooooooooooooooooooooooo o ooooooooo o ooooo ooooooo ooo oooo ooooo oooo ooooooooooooo o oo o oooo o ooo oooooooooooo oooooooooooooooooooooooooooooo oo ooooo oo oooo ooo oooooo o o oooo o ooo o oooo o oooo ooo oooooooooooooooo ooooooo oooooooooooooooooooooooooooooo oooooooooooooooooooooooo0.86.0oooo o ooooooo o oooooooo ooooooooooooooo o oooooo oooooooooooooo ooooo oo ooooo oo oo o o o ooo ooo oooooooooo oo ooo o oooooooooooooooo o oo o ooo ooo o o o o ooooooo oooooooooooo ooooooooooooooooooooooooooooo oo oooo o oooo o oo oooooooooooo o ooooooooooooooooooo o o ooooooo oo o oo o o ooooooo oo o ooooooooooooooooo oooo oo ooo ooooo ooooooo ooooooooooooooooo oooooo ooooooo oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooageooo o oooo ooooo ooo oooo ooooooooooo ooooooooooo oooooooooooooooooo o ooo ooooooooolbphooooooooooooooo o oo oooo o oo oo oo oooooo oooooooooooo oooooo oooo ooo ooooooooooooooooooooooooooooo oooo oo ooooooo oo oooooooooooooooooo o ooo ooooooooooooooo o ooooo oo ooooooooooooooooooooooo o o0.4ooooooooooooooooooo0.0oooooooooo o oo o oo ooooo oooo ooooooooooooo ooooooo o ooooooooooo oooooooooooooooo oooooo o oooooooooooooo oo ooo ooooo ooo oooooooooooooooo o ooo ooooooooooooooo ooooooooooooo oooooooooooooooooooooooooooooooooooooooooooooooooooo ooo ooooooooooo oo ooooo o o ooo oooo oo oooooooooooooooo o o ooooooooooooooooooooooooooooooooo ooo ooooo ooooo oo oooooo oooo oo ooo o oooooo ooooooooooooooooooooooooooooooo oooooo ooo o oooo ooo o o o oooooo oooo oo oo ooooooooooooooooooooo o o o o oooo oo ooooo ooooo oooooooo ooo ooooooo oo o oooooooooooo oo o oo ooooooooooooooooooooooooooooooooooooooooooooooooooooooo oo ooo ooooo ooo ooooooo ooo oo o ooo ooo ooooooooo ooo oooo o oooooooooo o o o ooo ooooooooooooooooooooooooooooooooo ooooooooooooo7.0ooooooooooooooooo8.09.0oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo oooo ooo ooooooooooo o o ooooooooooooooooooo ooooo oooooooooooooooooooooooo oo o oo oo oooooooooo ooooo ooo o o o o oooooooooo ooo o oooooooo ooooo oooooooooooooooo oo o o oo oo oooo ooooo ooo o oo oooooooooooooooo ooooo ooooooooooo oooo ooooooooo ooooo oo o o oooo oooooooo ooo ooooooo oooooooooooooo ooooo ooooooooooooooooooo oooo oooo ooo oooo oooo ooo ooooooooooooooo ooooooo o o o o o o oooo ooo o o ooo o o0.8oo4.560o ooo ooooooooooooooo oo ooooooo oo o oooooo oo oooooooooo oooooooooooooooo oo o oo ooooo oo3.550o2.540oo oooo ooo ooooooooo oooo oooo oo ooooooooooooooooooooooooooooooo oo oooooooooo ooooooooo ooooo oo241304321−1807060502oooo oo o oo ooooooooooooooo ooo oo ooooo oo oooooooooo o ooo ooooo ooooooooooo ooo ooo ooooooo oolpsa401−1−130 1 2 3 4 51.