Building machine learning systems with Python (779436), страница 43

Файл №779436 Building machine learning systems with Python (Building machine learning systems with Python) 43 страницаBuilding machine learning systems with Python (779436) страница 432017-12-262017-12-26СтудИзба

Building machine learning systems with Python

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 43)

Jug wasalso designed to work very well in batch computing environments, whichuse queuing systems such as PBS (Portable Batch System), LSF (LoadSharing Facility), or Grid Engine. This will be used in the second halfof the chapter as we build online clusters and dispatch jobs to them.[ 264 ]Chapter 12An introduction to tasks in jugTasks are the basic building block of jug. A task is composed of a function and valuesfor its arguments.

Consider this simple example:def double(x):return 2*xIn this chapter, the code examples will generally have to be typed in script files.Thus, they will not be shown with the >>> marker. Commands that should betyped at the shell will be indicated by preceding them with $.A task could be "call double with argument 3". Another task would be "call doublewith argument 642.34". Using jug, we can build these tasks as follows:from jug import Taskt1 = Task(double, 3)t2 = Task(double, 642.34)Save this to a file called jugfile.py (which is just a regular Python file). Now, wecan run jug execute to run the tasks.

This is something you type on the commandline, not at the Python prompt, so we show it marked with a dollar sign ($):$ jug executeYou will also get some feedback on the tasks (jug will say that two tasks nameddouble were run). Run jug execute again and it will tell you that it did nothing!It does not need to. In this case, we gained little, but if the tasks took a long time tocompute, it would have been very useful.You may notice that a new directory also appeared on your hard drive namedjugfile.jugdata with a few weirdly named files.

This is the memoization cache.If you remove it, jug execute will run all your tasks again.Often, it's good to distinguish between pure functions, which simply take theirinputs and return a result, from more general functions that can perform actions(such as reading from files, writing to files, accessing global variables, modify theirarguments, or anything that the language allows).

Some programming languages,such as Haskell, even have syntactic ways to distinguish pure from impure functions.[ 265 ]Bigger DataWith jug, your tasks do not need to be perfectly pure. It's even recommended thatyou use tasks to read in your data or write out your results. However, accessingand modifying global variables will not work well: the tasks may be run in anyorder in different processors.

The exceptions are global constants, but even this mayconfuse the memoization system (if the value is changed between runs). Similarly,you should not modify the input values. Jug has a debug mode (use jug execute--debug), which slows down your computation, but will give you useful errormessages if you make this sort of mistake.The preceding code works, but is a bit cumbersome. You are always repeating theTask(function, argument) construct.

Using a bit of Python magic, we can makethe code even more natural as follows:from jug import TaskGeneratorfrom time import sleep@TaskGeneratordef double(x):sleep(4)return 2*x@TaskGeneratordef add(a, b):return a + b@TaskGeneratordef print_final_result(oname, value):with open(oname, 'w') as output:output.write('Final result: {}\n'.format(value))y = double(2)z = double(y)y2 = double(7)z2 = double(y2)print_final_result('output.txt', add(z,z2))[ 266 ]Chapter 12Except for the use of TaskGenerator, the preceding code could be a standard Pythonfile! However, using TaskGenerator, it actually creates a series of tasks and it is nowpossible to run it in a way that takes advantage of multiple processors.

Behind thescenes, the decorator transforms your functions so that they do not actually executewhen called, but create a Task object. We also take advantage of the fact that we canpass tasks to other tasks and this results in a dependency being generated.You may have noticed that we added a few sleep(4) calls in the preceding code.This simulates running a long computation. Otherwise, this example is so fast thatthere is no point in using multiple processors.We start by running jug status, which results in the output shown in thefollowing screenshot:Now, we start two processes simultaneously (using the & operator in the background):$ jug execute &$ jug execute &Now, we run jug status again:We can see that the two initial double operators are running at the same time. Afterabout 8 seconds, the whole process will finish and the output.txt file will be written.By the way, if your file was called anything other than jugfile.py, you would thenhave to specify it explicitly on the command line.

For example, if your file was calledanalysis.py, you would run the following command:$ jug execute analysis.py[ 267 ]Bigger DataThis is the only disadvantage of not using the name jugfile.py. So, feel free to usemore meaningful names.Looking under the hoodHow does jug work? At the basic level, it's very simple. A Task is a function plusits argument. Its arguments may be either values or other tasks. If a task takes othertasks, there is a dependency between the two tasks (and the second one cannot berun until the results of the first task are available).Based on this, jug recursively computes a hash for each task. This hash value encodesthe whole computation to get the result.

When you run jug execute, for each task,there is a little loop that runs the logic depicted in the following flowchart:[ 268 ]Chapter 12The default backend writes the file to disk (in this funny directory named jugfile.jugdata/). Another backend is available, which uses a Redis database. With properlocking, which jug takes care of, this also allows for many processors to executetasks; each process will independently look at all the tasks and run the ones that havenot run yet and then write them back to the shared backend.

This works on either thesame machine (using multicore processors) or in multiple machines as long as theyall have access to the same backend (for example, using a network disk or the Redisdatabases). In the second half of this chapter, we will discuss computer clusters, butfor now let's focus on multiple cores.You can also understand why it's able to memoize intermediate results. If thebackend already has the result of a task, it's not run again. On the other hand, if youchange the task, even in minute ways (by altering one of the parameters), its hashwill change. Therefore, it will be rerun. Furthermore, all tasks that depend on it willalso have their hashes changed and they will be rerun as well.Using jug for data analysisJug is a generic framework, but it's ideally suited for medium-scale data analysis.As you develop your analysis pipeline, it's good to have intermediate resultsautomatically saved.

If you have already computed the preprocessing step beforeand are only changing the features you compute, you do not want to recomputethe preprocessing step. If you have already computed the features, but want to trycombining a few new ones into the mix, you also do not want to recompute all yourother features.Jug is also specifically optimized to work with NumPy arrays. Whenever your tasksreturn or receive NumPy arrays, you are taking advantage of this optimization. Jugis another piece of this ecosystem where everything works together.We will now look back at Chapter 10, Computer Vision.

In that chapter, we learnedhow to compute features on images. Remember that the basic pipeline consisted ofthe following features:• Loading image files• Computing features• Combining these features• Normalizing the features• Creating a classifier[ 269 ]Bigger DataWe are going to redo this exercise, but this time with the use of jug. The advantage ofthis version is that it's now possible to add a new feature or classifier without havingto recompute all of the pipeline.We start with a few imports as follows:from jug import TaskGeneratorimport mahotas as mhfrom glob import globNow, we define the first task generators and feature computation functions:@TaskGeneratordef compute_texture(im):from features import textureimc = mh.imread(im)return texture(mh.colors.rgb2gray(imc))@TaskGeneratordef chist_file(fname):from features import chistim = mh.imread(fname)return chist(im)The features module we import is the one from Chapter 10, Computer Vision.We write functions that take the filename as input instead of the imagearray.

Using the full images would also work, of course, but this is a smalloptimization. A filename is a string, which is small if it gets written to thebackend. It's also very fast to compute a hash if needed. It also ensuresthat the images are only loaded by the processes that need them.We can use TaskGenerator on any function. This is true even for functions, whichwe did not write, such as np.array, np.hstack, or the following command:import numpy as npto_array = TaskGenerator(np.array)hstack = TaskGenerator(np.hstack)haralicks = []chists = [][ 270 ]Chapter 12labels = []# Change this variable to point to# the location of the dataset on diskbasedir = '../SimpleImageDataset/'# Use glob to get all the imagesimages = glob('{}/*.jpg'.format(basedir))for fname in sorted(images):haralicks.append(compute_texture(fname))chists.append(chist_file(fname))# The class is encoded in the filename as xxxx00.jpglabels.append(fname[:-len('00.jpg')])haralicks = to_array(haralicks)chists = to_array(chists)labels = to_array(labels)One small inconvenience of using jug is that we must always write functions tooutput the results to files, as shown in the preceding examples.

This is a smallprice to pay for the extra convenience of using jug.@TaskGeneratordef accuracy(features, labels):from sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn import cross_validationclf = Pipeline([('preproc', StandardScaler()),('classifier', LogisticRegression())])cv = cross_validation.LeaveOneOut(len(features))scores = cross_validation.cross_val_score(clf, features, labels, cv=cv)return scores.mean()[ 271 ]Bigger DataNote that we are only importing sklearn inside this function.

Характеристики

Тип файла

PDF-файл

Размер

6,49 Mb

Материал

Building machine learning systems with Python

Тип материала

Книга

Предмет

Системы автоматического управления (САУ) (МТ-11)

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

building-machine-learning-systems-with-python-1474685854-1514288745.rar

Building machine learning systems with Python.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.