The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 92

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 92 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 922020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 92)

Hebb (1949) heavilyinfluenced the development of learning algorithms. The resurgence of neuralnetworks in the mid 1980s was due to Werbos (1974), Parker (1985) andRumelhart et al. (1986), who proposed the back-propagation algorithm.Today there are many books written on the topic, for a broad range ofaudiences. For readers of this book, Hertz et al. (1991), Bishop (1995) andRipley (1996) may be the most informative. Bayesian learning for neuralnetworks is described in Neal (1996). The ZIP code example was taken fromLe Cun (1989); see also Le Cun et al.

(1990) and Le Cun et al. (1998).We do not discuss theoretical topics such as approximation properties ofneural networks, such as the work of Barron (1993), Girosi et al. (1995)and Jones (1992). Some of these results are summarized by Ripley (1996).ExercisesEx. 11.1 Establish the exact correspondence between the projection pursuit regression model (11.1) and the neural network (11.5).

In particular,show that the single-layer regression network is equivalent to a PPR modelTTwith gm (ωmx) = βm σ(α0m + sm (ωmx)), where ωm is the mth unit vector.Establish a similar equivalence for a classification network.Ex. 11.2 Consider a neural network for a quantitative outcome as in (11.5),using squared-error loss and identity output function gk (t) = t. Supposethat the weights αm from the input to hidden layer are nearly zero.

Showthat the resulting model is nearly linear in the inputs.Ex. 11.3 Derive the forward and backward propagation equations for thecross-entropy loss function.Ex. 11.4 Consider a neural network for a K class outcome that uses crossentropy loss. If the network has no hidden layer, show that the model isequivalent to the multinomial logistic model described in Chapter 4.Ex. 11.5(a) Write a program to fit a single hidden layer neural network (ten hiddenunits) via back-propagation and weight decay.416Neural Networks(b) Apply it to 100 observations from the modelY = σ(aT1 X) + (aT2 X)2 + 0.30 · Z,where σ is the sigmoid function, Z is standard normal, X T = (X1 , X2 ),each Xj being independent standard normal, and a1 = (3, 3), a2 =(3, −3). Generate a test sample of size 1000, and plot the training andtest error curves as a function of the number of training epochs, fordifferent values of the weight decay parameter.

Discuss the overfittingbehavior in each case.(c) Vary the number of hidden units in the network, from 1 up to 10, anddetermine the minimum number needed to perform well for this task.Ex. 11.6 Write a program to carry out projection pursuit regression, usingcubic smoothing splines with fixed degrees of freedom. Fit it to the datafrom the previous exercise, for various values of the smoothing parameterand number of model terms. Find the minimum number of model termsnecessary for the model to perform well and compare this to the numberof hidden units from the previous exercise.Ex.

11.7 Fit a neural network to the spam data of Section 9.1.2, and comparethe results to those for the additive model given in that chapter. Compareboth the classification performance and interpretability of the final model.This is page 417Printer: Opaque this12Support Vector Machines andFlexible Discriminants12.1 IntroductionIn this chapter we describe generalizations of linear decision boundariesfor classification. Optimal separating hyperplanes are introduced in Chapter 4 for the case when two classes are linearly separable.

Here we coverextensions to the nonseparable case, where the classes overlap. These techniques are then generalized to what is known as the support vector machine,which produces nonlinear boundaries by constructing a linear boundary ina large, transformed version of the feature space.

The second set of methodsgeneralize Fisher’s linear discriminant analysis (LDA). The generalizationsinclude flexible discriminant analysis which facilitates construction of nonlinear boundaries in a manner very similar to the support vector machines,penalized discriminant analysis for problems such as signal and image classification where the large number of features are highly correlated, andmixture discriminant analysis for irregularly shaped classes.12.2 The Support Vector ClassifierIn Chapter 4 we discussed a technique for constructing an optimal separating hyperplane between two perfectly separated classes. We review this andgeneralize to the nonseparable case, where the classes may not be separableby a linear boundary.41812.

Flexible DiscriminantsxT β + β0 = 0• ••••••••M=••1kβkM=•• •• ∗ ••ξ5• •∗•ξ ∗• ξ31•∗•• •ξ2••• • •margin•••• ξ4∗•••• •xT β + β0 = 0•1kβk••• M =1kβkmarginM=1kβkFIGURE 12.1. Support vector classifiers. The left panel shows the separablecase. The decision boundary is the solid line, while broken lines bound the shadedmaximal margin of width 2M = 2/kβk. The right panel shows the nonseparable(overlap) case. The points labeled ξj∗ are on the wrong side of their margin byan amount ξj∗ = M ξj ; points on thePcorrect side have ξj∗ = 0.margin isP Themaximized subject to a total budgetξi ≤ constant. Henceξj∗ is the totaldistance of points on the wrong side of their margin.Our training data consists of N pairs (x1 , y1 ), (x2 , y2 ), .

. . , (xN , yN ), withxi ∈ IRp and yi ∈ {−1, 1}. Define a hyperplane by{x : f (x) = xT β + β0 = 0},(12.1)where β is a unit vector: kβk = 1. A classification rule induced by f (x) isG(x) = sign[xT β + β0 ].(12.2)The geometry of hyperplanes is reviewed in Section 4.5, where we show thatf (x) in (12.1) gives the signed distance from a point x to the hyperplanef (x) = xT β +β0 = 0. Since the classes are separable, we can find a functionf (x) = xT β + β0 with yi f (xi ) > 0 ∀i. Hence we are able to find thehyperplane that creates the biggest margin between the training points forclass 1 and −1 (see Figure 12.1).

The optimization problemmaxβ,β0 ,kβk=1Msubject to yi (xTi β + β0 ) ≥ M, i = 1, . . . , N,(12.3)captures this concept. The band in the figure is M units away from thehyperplane on either side, and hence 2M units wide. It is called the margin.We showed that this problem can be more conveniently rephrased asmin kβkβ,β0subject to yi (xTi β + β0 ) ≥ 1, i = 1, . . . , N,(12.4)12.2 The Support Vector Classifier419where we have dropped the norm constraint on β.

Note that M = 1/kβk.Expression (12.4) is the usual way of writing the support vector criterionfor separated data. This is a convex optimization problem (quadratic criterion, linear inequality constraints), and the solution is characterized inSection 4.5.2.Suppose now that the classes overlap in feature space. One way to dealwith the overlap is to still maximize M , but allow for some points to be onthe wrong side of the margin. Define the slack variables ξ = (ξ1 , ξ2 , . .

. , ξN ).There are two natural ways to modify the constraint in (12.3):yi (xTi β + β0 )yi (xTi βPN+ β0 )≥or≥M − ξi ,(12.5)M (1 − ξi ),(12.6)∀i, ξi ≥ 0,i=1 ξi ≤ constant. The two choices lead to different solutions.The first choice seems more natural, since it measures overlap in actualdistance from the margin; the second choice measures the overlap in relativedistance, which changes with the width of the margin M . However, the firstchoice results in a nonconvex optimization problem, while the second isconvex; thus (12.6) leads to the “standard” support vector classifier, whichwe use from here on.Here is the idea of the formulation.

The value ξi in the constraint yi (xTi β+β0 ) ≥ M (1 − ξi ) is the proportional amount by which the predictionf (xi )P= xTi β +β0 is on the wrong side of its margin. Hence by bounding thesumξi , we bound the total proportional amount by which predictionsfall on the wrongP side of their margin. Misclassifications occur when ξi > 1,so boundingξi at a value K say, bounds the total number of trainingmisclassifications at K.As in (4.48) in Section 4.5.2, we can drop the norm constraint on β,define M = 1/kβk, and write (12.4) in the equivalent form(yi (xTi β + β0 ) ≥ 1 − ξi ∀i,min kβk subject to(12.7)Pξi ≥ 0,ξi ≤ constant.This is the usual way the support vector classifier is defined for the nonseparable case. However we find confusing the presence of the fixed scale“1” in the constraint yi (xTi β + β0 ) ≥ 1 − ξi , and prefer to start with (12.6).The right panel of Figure 12.1 illustrates this overlapping case.By the nature of the criterion (12.7), we see that points well inside theirclass boundary do not play a big role in shaping the boundary.

This seemslike an attractive property, and one that differentiates it from linear discriminant analysis (Section 4.3). In LDA, the decision boundary is determined by the covariance of the class distributions and the positions of theclass centroids. We will see in Section 12.3.3 that logistic regression is moresimilar to the support vector classifier in this regard.42012. Flexible Discriminants12.2.1 Computing the Support Vector ClassifierThe problem (12.7) is quadratic with linear inequality constraints, hence itis a convex optimization problem. We describe a quadratic programmingsolution using Lagrange multipliers.

Computationally it is convenient tore-express (12.7) in the equivalent formNX1ξimin kβk2 + Cβ,β0 2i=1subject to ξi ≥ 0,(12.8)yi (xTi β+ β0 ) ≥ 1 − ξi ∀i,where the “cost” parameter C replaces the constant in (12.7); the separablecase corresponds to C = ∞.The Lagrange (primal) function isLP =NNNXXX1µi ξi , (12.9)αi [yi (xTi β + β0 ) − (1 − ξi )] −ξi −kβk2 + C2i=1i=1i=1which we minimize w.r.t β, β0 and ξi . Setting the respective derivatives tozero, we getβ=NXαi yi x i ,(12.10)αi yi ,(12.11)C − µi , ∀i,(12.12)i=10=NXi=1αi=as well as the positivity constraints αi , µi , ξi ≥ 0 ∀i.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.