The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 85

Файл №811377 The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf) 85 страницаThe Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377) страница 852020-08-252020-08-25СтудИзба

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 85)

In Section 4.2 we discuss other problems with linear activation functions, in particular potentially severe masking effects.The units in the middle of the network, computing the derived featuresZm , are called hidden units because the values Zm are not directly observed. In general there can be more than one hidden layer, as illustratedin the example at the end of this chapter. We can think of the Zm as abasis expansion of the original inputs X; the neural network is then a standard linear model, or linear multilogit model, using these transformationsas inputs. There is, however, an important enhancement over the basisexpansion techniques discussed in Chapter 5; here the parameters of thebasis functions are learned from the data.Neural Networks0.50.01/(1 + e−v )1.0394-10-50510vFIGURE 11.3.

Plot of the sigmoid function σ(v) = 1/(1+exp(−v)) (red curve),commonly used in the hidden layer of a neural network. Included are σ(sv) fors = 21 (blue curve) and s = 10 (purple curve). The scale parameter s controlsthe activation rate, and we can see that large s amounts to a hard activation atv = 0. Note that σ(s(v − v0 )) shifts the activation threshold from 0 to v0 .Notice that if σ is the identity function, then the entire model collapsesto a linear model in the inputs.

Hence a neural network can be thought ofas a nonlinear generalization of the linear model, both for regression andclassification. By introducing the nonlinear transformation σ, it greatlyenlarges the class of linear models. In Figure 11.3 we see that the rate ofactivation of the sigmoid depends on the norm of αm , and if kαm k is verysmall, the unit will indeed be operating in the linear part of its activationfunction.Notice also that the neural network model with one hidden layer hasexactly the same form as the projection pursuit model described above.The difference is that the PPR model uses nonparametric functions gm (v),while the neural network uses a far simpler function based on σ(v), withthree free parameters in its argument. In detail, viewing the neural networkmodel as a PPR model, we identifyTgm (ωmX)=Tβm σ(α0m + αmX)=Tβm σ(α0m + kαm k(ωmX)),(11.7)where ωm = αm /kαm k is the mth unit-vector. Since σβ,α0 ,s (v) = βσ(α0 +sv) has lower complexity than a more general nonparametric g(v), it is notsurprising that a neural network might use 20 or 100 such functions, whilethe PPR model typically uses fewer terms (M = 5 or 10, for example).Finally, we note that the name “neural networks” derives from the factthat they were first developed as models for the human brain.

Each unitrepresents a neuron, and the connections (links in Figure 11.2) representsynapses. In early models, the neurons fired when the total signal passed tothat unit exceeded a certain threshold. In the model above, this corresponds11.4 Fitting Neural Networks395to use of a step function for σ(Z) and gm (T ). Later the neural network wasrecognized as a useful tool for nonlinear statistical modeling, and for thispurpose the step function is not smooth enough for optimization. Hence thestep function was replaced by a smoother threshold function, the sigmoidin Figure 11.3.11.4 Fitting Neural NetworksThe neural network model has unknown parameters, often called weights,and we seek values for them that make the model fit the training data well.We denote the complete set of weights by θ, which consists of{α0m , αm ; m = 1, 2, .

. . , M } M (p + 1) weights,{β0k , βk ; k = 1, 2, . . . , K} K(M + 1) weights.(11.8)For regression, we use sum-of-squared errors as our measure of fit (errorfunction)R(θ) =NK XXk=1 i=1(yik − fk (xi ))2 .(11.9)For classification we use either squared error or cross-entropy (deviance):R(θ) = −KN XXyik log fk (xi ),(11.10)i=1 k=1and the corresponding classifier is G(x) = argmaxk fk (x).

With the softmaxactivation function and the cross-entropy error function, the neural networkmodel is exactly a linear logistic regression model in the hidden units, andall the parameters are estimated by maximum likelihood.Typically we don’t want the global minimizer of R(θ), as this is likelyto be an overfit solution. Instead some regularization is needed: this isachieved directly through a penalty term, or indirectly by early stopping.Details are given in the next section.The generic approach to minimizing R(θ) is by gradient descent, calledback-propagation in this setting.

Because of the compositional form of themodel, the gradient can be easily derived using the chain rule for differentiation. This can be computed by a forward and backward sweep over thenetwork, keeping track only of quantities local to each unit.396Neural NetworksHere is back-propagation in detail for squared error loss. Let zmi =Txi ), from (11.5) and let zi = (z1i , z2i , .

. . , zM i ). Then we haveσ(α0m + αmNXR(θ) ≡Rii=1KN XX=(yik − fk (xi ))2 ,(11.11)i=1 k=1with derivatives∂Ri= −2(yik − fk (xi ))gk′ (βkT zi )zmi ,∂βkmKX∂RiT2(yik − fk (xi ))gk′ (βkT zi )βkm σ ′ (αmxi )xiℓ .=−∂αmℓ(11.12)k=1Given these derivatives, a gradient descent update at the (r + 1)st iteration has the form(r+1)βkm(r+1)αmℓ(r)= βkm − γr=(r)αmℓ− γrNX∂Ri(r)i=1∂βkmNX∂Ri(r)i=1∂αmℓwhere γr is the learning rate, discussed below.Now write (11.12) as∂Ri= δki zmi ,∂βkm∂Ri= smi xiℓ .∂αmℓ,(11.13),(11.14)The quantities δki and smi are “errors” from the current model at theoutput and hidden layer units, respectively.

From their definitions, theseerrors satisfyKXTβkm δki ,(11.15)smi = σ ′ (αmxi )k=1known as the back-propagation equations. Using this, the updates in (11.13)can be implemented with a two-pass algorithm. In the forward pass, thecurrent weights are fixed and the predicted values fˆk (xi ) are computedfrom formula (11.5). In the backward pass, the errors δki are computed,and then back-propagated via (11.15) to give the errors smi . Both sets oferrors are then used to compute the gradients for the updates in (11.13),via (11.14).11.5 Some Issues in Training Neural Networks397This two-pass procedure is what is known as back-propagation.

It hasalso been called the delta rule (Widrow and Hoff, 1960). The computationalcomponents for cross-entropy have the same form as those for the sum ofsquares error function, and are derived in Exercise 11.3.The advantages of back-propagation are its simple, local nature. In theback propagation algorithm, each hidden unit passes and receives information only to and from units that share a connection. Hence it can beimplemented efficiently on a parallel architecture computer.The updates in (11.13) are a kind of batch learning, with the parameter updates being a sum over all of the training cases. Learning can alsobe carried out online—processing each observation one at a time, updating the gradient after each training case, and cycling through the trainingcases many times.

In this case, the sums in equations (11.13) are replacedby a single summand. A training epoch refers to one sweep through theentire training set. Online training allows the network to handle very largetraining sets, and also to update the weights as new observations come in.The learning rate γr for batch learning is usually taken to be a constant, and can also be optimized by a line search that minimizes the errorfunction at each update.

With online learning γr should decrease to zeroas the iteration r → ∞. This learning is a form of stochastic approximation (Robbinsand Munro,P1951); results in this field ensure convergence ifPγr → 0, r γr = ∞, and r γr2 < ∞ (satisfied, for example, by γr = 1/r).Back-propagation can be very slow, and for that reason is usually notthe method of choice. Second-order techniques such as Newton’s methodare not attractive here, because the second derivative matrix of R (theHessian) can be very large. Better approaches to fitting include conjugategradients and variable metric methods.

Характеристики

Тип файла

PDF-файл

Размер

12,69 Mb

Материал

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Тип материала

Книга

Предмет

(ППП СОиАД) (SAS) Пакеты прикладных программ для статистической обработки и анализа данных

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

the-elements-of-statistical-learning.-data-mining_-inference_-and-prediction.pdf.rar

The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.