The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 35

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 35 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 352020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 35)

Again one can enlarge the space using basis transformations, but this can lead to artificialExercises135separation through over-fitting. In Chapter 12 we discuss a more attractivealternative known as the support vector machine, which allows for overlap,but minimizes a measure of the extent of this overlap.Bibliographic NotesGood general texts on classification include Duda et al. (2000), Hand(1981), McLachlan (1992) and Ripley (1996). Mardia et al.

(1979) havea concise discussion of linear discriminant analysis. Michie et al. (1994)compare a large number of popular classifiers on benchmark datasets. Linear separating hyperplanes are discussed in Vapnik (1996). Our account ofthe perceptron learning algorithm follows Ripley (1996).ExercisesEx. 4.1 Show how to solve the generalized eigenvalue problem max aT Basubject to aT Wa = 1 by transforming to a standard eigenvalue problem.Ex.

4.2 Suppose we have features x ∈ IRp , a two-class response, with classsizes N1 , N2 , and the target coded as −N/N1 , N/N2 .(a) Show that the LDA rule classifies to class 2 ifxT Σ̂−1(µ̂2 − µ̂1 ) >−11(µ̂2 + µ̂1 )T Σ̂ (µ̂2 − µ̂1 ) − log(N2 /N1 ),2and class 1 otherwise.(b) Consider minimization of the least squares criterionNXi=1(yi − β0 − xTi β)2 .Show that the solution β̂ satisfiesih(N − 2)Σ̂ + N Σ̂B β = N (µ̂2 − µ̂1 )(after simplification), where Σ̂B =N1 N2N 2 (µ̂2(4.55)(4.56)− µ̂1 )(µ̂2 − µ̂1 )T .(c) Hence show that Σ̂B β is in the direction (µ̂2 − µ̂1 ) and thusβ̂ ∝ Σ̂−1(µ̂2 − µ̂1 ).(4.57)Therefore the least-squares regression coefficient is identical to theLDA coefficient, up to a scalar multiple.1364. Linear Methods for Classification(d) Show that this result holds for any (distinct) coding of the two classes.(e) Find the solution β̂0 (up to the same scalar multiple as in (c), andhence the predicted value fˆ(x) = β̂0 + xT β̂.

Consider the followingrule: classify to class 2 if fˆ(x) > 0 and class 1 otherwise. Show this isnot the same as the LDA rule unless the classes have equal numbersof observations.(Fisher, 1936; Ripley, 1996)Ex. 4.3 Suppose we transform the original predictors X to Ŷ via linearregression. In detail, let Ŷ = X(XT X)−1 XT Y = XB̂, where Y is theindicator response matrix. Similarly for any input x ∈ IRp , we get a transformed vector ŷ = B̂T x ∈ IRK . Show that LDA using Ŷ is identical toLDA in the original space.Ex. 4.4 Consider the multilogit model with K classes (4.17). Let β be the(p + 1)(K − 1)-vector consisting of all the coefficients. Define a suitablyenlarged version of the input vector x to accommodate this vectorized coefficient matrix.

Derive the Newton-Raphson algorithm for maximizing themultinomial log-likelihood, and describe how you would implement thisalgorithm.Ex. 4.5 Consider a two-class logistic regression problem with x ∈ IR. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample xi for the two classes are separated by a point x0 ∈ IR.Generalize this result to (a) x ∈ IRp (see Figure 4.16), and (b) more thantwo classes.Ex. 4.6 Suppose we have N points xi in IRp in general position, with classlabels yi ∈ {−1, 1}. Prove that the perceptron learning algorithm convergesto a separating hyperplane in a finite number of steps:(a) Denote a hyperplane by f (x) = β1T x + β0 = 0, or in more compactnotation β T x∗ = 0, where x∗ = (x, 1) and β = (β1 , β0 ).

Let zi =x∗i /||x∗i ||. Show that separability implies the existence of a βsep suchTthat yi βsepzi ≥ 1 ∀i(b) Given a current βold , the perceptron algorithm identifies a point zi thatis misclassified, and produces the update βnew ← βold + yi zi . Showthat ||βnew −βsep ||2 ≤ ||βold −βsep ||2 −1, and hence that the algorithmconverges to a separating hyperplane in no more than ||βstart − βsep ||2steps (Ripley, 1996).Ex. 4.7 Consider the criterion∗D (β, β0 ) = −NXi=1yi (xTi β + β0 ),(4.58)Exercises137a generalization of (4.41) where we sum over all the observations. Considerminimizing D∗ subject to ||β|| = 1. Describe this criterion in words. Doesit solve the optimal separating hyperplane problem?Ex.

4.8 Consider the multivariate Gaussian model X|G = k ∼ N (µk , Σ),with the additional restriction that rank{µk }K1 = L < max(K − 1, p).Derive the constrained MLEs for the µk and Σ. Show that the Bayes classification rule is equivalent to classifying in the reduced subspace computedby LDA (Hastie and Tibshirani, 1996b).Ex. 4.9 Write a computer program to perform a quadratic discriminantanalysis by fitting a separate Gaussian model per class. Try it out on thevowel data, and compute the misclassification error for the test data. Thedata can be found in the book website www-stat.stanford.edu/ElemStatLearn.1384.

Linear Methods for ClassificationThis is page 139Printer: Opaque this5Basis Expansions and Regularization5.1 IntroductionWe have already made use of models linear in the input features, both forregression and classification. Linear regression, linear discriminant analysis,logistic regression and separating hyperplanes all rely on a linear model.It is extremely unlikely that the true function f (X) is actually linear inX.

In regression problems, f (X) = E(Y |X) will typically be nonlinear andnonadditive in X, and representing f (X) by a linear model is usually a convenient, and sometimes a necessary, approximation. Convenient because alinear model is easy to interpret, and is the first-order Taylor approximation to f (X). Sometimes necessary, because with N small and/or p large,a linear model might be all we are able to fit to the data without overfitting.

Likewise in classification, a linear, Bayes-optimal decision boundaryimplies that some monotone transformation of Pr(Y = 1|X) is linear in X.This is inevitably an approximation.In this chapter and the next we discuss popular methods for movingbeyond linearity. The core idea in this chapter is to augment/replace thevector of inputs X with additional variables, which are transformations ofX, and then use linear models in this new space of derived input features.Denote by hm (X) : IRp 7→ IR the mth transformation of X, m =1, .

. . , M . We then modelf (X) =MXm=1βm hm (X),(5.1)1405. Basis Expansions and Regularizationa linear basis expansion in X. The beauty of this approach is that once thebasis functions hm have been determined, the models are linear in thesenew variables, and the fitting proceeds as before.Some simple and widely used examples of the hm are the following:• hm (X) = Xm , m = 1, . . . , p recovers the original linear model.• hm (X) = Xj2 or hm (X) = Xj Xk allows us to augment the inputs withpolynomial terms to achieve higher-order Taylor expansions.

Note,however, that the number of variables grows exponentially in the degree of the polynomial. A full quadratic model in p variables requiresO(p2 ) square and cross-product terms, or more generally O(pd ) for adegree-d polynomial.p• hm (X) = log(Xj ), Xj , . . . permits other nonlinear transformationsof single inputs. More generally one can use similar functions involving several inputs, such as hm (X) = ||X||.• hm (X) = I(Lm ≤ Xk < Um ), an indicator for a region of Xk .

Bybreaking the range of Xk up into Mk such nonoverlapping regionsresults in a model with a piecewise constant contribution for Xk .Sometimes the problem at hand will call for particular basis functions hm ,such as logarithms or power functions. More often, however, we use the basisexpansions as a device to achieve more flexible representations for f (X).Polynomials are an example of the latter, although they are limited bytheir global nature—tweaking the coefficients to achieve a functional formin one region can cause the function to flap about madly in remote regions.In this chapter we consider more useful families of piecewise-polynomialsand splines that allow for local polynomial representations. We also discussthe wavelet bases, especially useful for modeling signals and images.

Thesemethods produce a dictionary D consisting of typically a very large number|D| of basis functions, far more than we can afford to fit to our data. Alongwith the dictionary we require a method for controlling the complexityof our model, using basis functions from the dictionary. There are threecommon approaches:• Restriction methods, where we decide before-hand to limit the classof functions. Additivity is an example, where we assume that ourmodel has the formf (X)=pXfj (Xj )j=1=Mjp XXj=1 m=1βjm hjm (Xj ).(5.2)5.2 Piecewise Polynomials and Splines141The size of the model is limited by the number of basis functions Mjused for each component function fj .• Selection methods, which adaptively scan the dictionary and includeonly those basis functions hm that contribute significantly to the fit ofthe model. Here the variable selection techniques discussed in Chapter 3 are useful.

The stagewise greedy approaches such as CART,MARS and boosting fall into this category as well.• Regularization methods where we use the entire dictionary but restrict the coefficients. Ridge regression is a simple example of a regularization approach, while the lasso is both a regularization and selection method. Here we discuss these and more sophisticated methodsfor regularization.5.2 Piecewise Polynomials and SplinesWe assume until Section 5.7 that X is one-dimensional. A piecewise polynomial function f (X) is obtained by dividing the domain of X into contiguous intervals, and representing f by a separate polynomial in each interval.Figure 5.1 shows two simple piecewise polynomials.

The first is piecewiseconstant, with three basis functions:h1 (X) = I(X < ξ1 ),h2 (X) = I(ξ1 ≤ X < ξ2 ),h3 (X) = I(ξ2 ≤ X).Since these are positiveP3 over disjoint regions, the least squares estimate ofthe model f (X) = m=1 βm hm (X) amounts to β̂m = Ȳm , the mean of Yin the mth region.The top right panel shows a piecewise linear fit. Three additional basisfunctions are needed: hm+3 = hm (X)X, m = 1, . . .

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.