Building machine learning systems with Python (779436), страница 41

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 41 страницаBuilding machine learning systems with Python (779436) страница 412017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 41)

So, we still have to specify where to make the cut,what part of the features are we willing to take, and what part do we want to drop?Coming back to scikit-learn, we find various excellent wrapper classes in the sklearn.feature_selection package. A real workhorse in this field is RFE, which stands forrecursive feature elimination. It takes an estimator and the desired number of featuresto keep as parameters and then trains the estimator with various feature sets as long asit has found a subset of features that is small enough.

The RFE instance itself pretendsto be like an estimator, thereby, indeed, wrapping the provided estimator.In the following example, we create an artificial classification problem of 100 samplesusing datasets' convenient make_classification() function. It lets us specifythe creation of 10 features, out of which only three are really valuable to solve theclassification problem:>>> from sklearn.feature_selection import RFE>>> from sklearn.linear_model import LogisticRegression>>> from sklearn.datasets import make_classification>>> X,y = make_classification(n_samples=100, n_features=10,n_informative=3, random_state=0)>>> clf = LogisticRegression()>>> clf.fit(X, y)>>> selector = RFE(clf, n_features_to_select=3)>>> selector = selector.fit(X, y)>>> print(selector.support_)[FalseTrue FalseTrue False False False FalseTrue False]>>> print(selector.ranking_)[4 1 3 1 8 5 7 6 1 2]The problem in real-world scenarios is, of course, how can we know the right valuefor n_features_to_select? Truth is, we can't.

However, most of the time we canuse a sample of the data and play with it using different settings to quickly get afeeling for the right ballpark.[ 252 ]Chapter 11The good thing is that we don't have to be that exact using wrappers. Let's try differentvalues for n_features_to_select to see how support_ and ranking_ change:n_features_support_ranking_1[False False False True False False False False FalseFalse][ 6 3 5 1 10 7 98 2 4]2[False False False True False False False False TrueFalse][5 2 4 1 9 6 8 7 1 3]3[False True False True False False False False TrueFalse][4 1 3 1 8 5 7 6 1 2]4[False True False True False False False False TrueTrue][3 1 2 1 7 4 6 5 1 1]5[False True True True False False False False TrueTrue][2 1 1 1 6 3 5 4 1 1]6[ True True True True False False False False TrueTrue][1 1 1 1 5 2 4 3 1 1]7[ True True True True False True False False TrueTrue][1 1 1 1 4 1 3 2 1 1]8[ True True True True False True False True TrueTrue][1 1 1 1 3 1 2 1 1 1]9[ True True True True False True True True TrueTrue][1 1 1 1 2 1 1 1 1 1]10[ True True True True True True True True TrueTrue][1 1 1 1 1 1 1 1 1 1]to_selectWe see that the result is very stable.

Features that have been used when requestingsmaller feature sets keep on getting selected when letting more features in. At last,we rely on our train/test set splitting to warn us when we go the wrong way.Other feature selection methodsThere are several other feature selection methods that you will discover whilereading through machine learning literature. Some even don't look like being afeature selection method because they are embedded into the learning process (notto be confused with the aforementioned wrappers). Decision trees, for instance, havea feature selection mechanism implanted deep in their core. Other learning methodsemploy some kind of regularization that punishes model complexity, thus drivingthe learning process towards good performing models that are still "simple".

They dothis by decreasing the less impactful features importance to zero and then droppingthem (L1-regularization).[ 253 ]Dimensionality ReductionSo watch out! Often, the power of machine learning methods has to be attributed totheir implanted feature selection method to a great degree.Feature extractionAt some point, after we have removed redundant features and dropped irrelevantones, we, often, still find that we have too many features.

No matter what learningmethod we use, they all perform badly and given the huge feature space weunderstand that they actually cannot do better. We realize that we have to cut livingflesh; we have to get rid of features, for which all common sense tells us that theyare valuable. Another situation when we need to reduce the dimensions and featureselection does not help much is when we want to visualize data. Then, we need tohave at most three dimensions at the end to provide any meaningful graphs.Enter feature extraction methods.

They restructure the feature space to make it moreaccessible to the model or simply cut down the dimensions to two or three so that wecan show dependencies visually.Again, we can distinguish feature extraction methods as being linear or non-linearones. Also, as seen before in the Selecting features section, we will present one methodfor each type (principal component analysis as a linear and non-linear version ofmultidimensional scaling). Although, they are widely known and used, they are onlyrepresentatives for many more interesting and powerful feature extraction methods.About principal component analysisPrincipal component analysis (PCA) is often the first thing to try out if you want tocut down the number of features and do not know what feature extraction methodto use.

PCA is limited as it's a linear method, but chances are that it already goes farenough for your model to learn well enough. Add to this the strong mathematicalproperties it offers and the speed at which it finds the transformed feature space andis later able to transform between original and transformed features; we can almostguarantee that it also will become one of your frequently used machine learning tools.Summarizing it, given the original feature space, PCA finds a linear projection ofitself in a lower dimensional space that has the following properties:• The conserved variance is maximized.• The final reconstruction error (when trying to go back from transformedfeatures to original ones) is minimized.As PCA simply transforms the input data, it can be applied both to classificationand regression problems.

In this section, we will use a classification task to discussthe method.[ 254 ]Chapter 11Sketching PCAPCA involves a lot of linear algebra, which we do not want to go into. Nevertheless,the basic algorithm can be easily described as follows:1. Center the data by subtracting the mean from it.2. Calculate the covariance matrix.3. Calculate the eigenvectors of the covariance matrix.If we start with N features, then the algorithm will return a transformed featurespace again with N dimensions (we gained nothing so far). The nice thing aboutthis algorithm, however, is that the eigenvalues indicate how much of the varianceis described by the corresponding eigenvector.Let's assume we start with N = 1000 features and we know that our model does notwork well with more than 20 features. Then, we simply pick the 20 eigenvectorswith the highest eigenvalues.Applying PCALet's consider the following artificial dataset, which is visualized in the following leftplot diagram:>>> x1 = np.arange(0, 10, .2)>>> x2 = x1+np.random.normal(loc=0, scale=1, size=len(x1))>>> X = np.c_[(x1, x2)]>>> good = (x1>5) | (x2>5) # some arbitrary classes>>> bad = ~good # to make the example look good[ 255 ]Dimensionality ReductionScikit-learn provides the PCA class in its decomposition package.

In this example, wecan clearly see that one dimension should be enough to describe the data. We canspecify this using the n_components parameter:>>> from sklearn import linear_model, decomposition, datasets>>> pca = decomposition.PCA(n_components=1)Also, here we can use the fit() and transform() methods of pca (or its fit_transform() combination) to analyze the data and project it in the transformedfeature space:>>> Xtrans = pca.fit_transform(X)As we have specified, Xtrans contains only one dimension.

You can see the result inthe preceding right plot diagram. The outcome is even linearly separable in this case.We would not even need a complex classifier to distinguish between both classes.To get an understanding of the reconstruction error, we can have a look at thevariance of the data that we have retained in the transformation:>>> print(pca.explained_variance_ratio_)>>> [ 0.96393127]This means that after going from two to one dimension, we are still left with96 percent of the variance.Of course, it's not always this simple.

Oftentimes, we don't know what number ofdimensions is advisable upfront. In that case, we leave n_components parameterunspecified when initializing PCA to let it calculate the full transformation. Afterfitting the data, explained_variance_ratio_ contains an array of ratios indecreasing order: The first value is the ratio of the basis vector describing thedirection of the highest variance, the second value is the ratio of the direction of thesecond highest variance, and so on. After plotting this array, we quickly get a feelof how many components we would need: the number of components immediatelybefore the chart has its elbow is often a good guess.Plots displaying the explained variance over the number ofcomponents is called a Scree plot. A nice example of combining a Screeplot with a grid search to find the best setting for the classificationproblem can be found at http://scikit-learn.sourceforge.net/stable/auto_examples/plot_digits_pipe.html.[ 256 ]Chapter 11Limitations of PCA and how LDA can helpBeing a linear method, PCA has, of course, its limitations when we are faced withdata that has non-linear relationships.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.