Building machine learning systems with Python (779436), страница 46
Текст из файла (страница 46)
Fortunately, StarCluster hasalready done half the work. If this was a real project, we would set up a script toperform all the initialization for us. StarCluster can do this. As this is a tutorial, wejust run the installation step again:$ pip install jug mahotas scikit-learnWe can use the same jugfile system as before, except that now, instead of running itdirectly on the master, we schedule it on the cluster.First, write a very simple wrapper script as follows:#!/usr/bin/env bashjug execute jugfile.pyCall it run-jugfile.sh and use chmod +x run-jugfile.sh to give it executablepermission. Now, we can schedule sixteen jobs on the cluster by using the followingcommand:$ for c in $(seq 16); do>qsub -cwd run-jugfile.sh> doneThis will create 16 jobs, each of which will run the run-jugfile.sh script, which wewill simply call jug.
You can still use the master as you wish. In particular, you can,at any moment, run jug status and see the status of the computation. In fact, jugwas developed in exactly such an environment, so it works very well in it.Eventually, the computation will finish. At this point, we need to first save ourresults. Then, we can kill off all the nodes. We create a directory ~/results andcopy our results here:# mkdir ~/results# cp results.image.txt ~/resultsNow, log off the cluster back to our worker machine:# exit[ 287 ]Bigger DataNow, we are back at our AWS machine (notice the $ sign in the next code examples).First, we copy the results back to this computer using the starcluster getcommand (which is the mirror image of put we used before):$ starcluster get smallcluster results resultsFinally, we should kill all the nodes to save money as follows:$ starcluster stop smallcluster$ starcluster terminate smallclusterNote that terminating will really destroy the filesystem and all yourresults.
In our case, we have copied the final results to safety manually.Another possibility is to have the cluster write to a filesystem, which isnot allocated and destroyed by StarCluster, but is available to you on aregular instance; in fact, the flexibility of these tools is immense. However,these advanced manipulations could not all fit in this chapter.StarCluster has excellent documentation online at http://star.mit.edu/cluster/, which you should read for more information about allthe possibilities of this tool. We have seen only a small fraction of thefunctionality and used only the default settings here.SummaryWe saw how to use jug, a little Python framework to manage computations in away that takes advantage of multiple cores or multiple machines.
Although thisframework is generic, it was built specifically to address the data analysis needs ofits author (who is also an author of this book). Therefore, it has several aspects thatmake it fit in with the rest of the Python machine learning environment.You also learned about AWS, the Amazon Cloud. Using cloud computing is often amore effective use of resources than building in-house computing capacity. This isparticularly true if your needs are not constant and are changing.
StarCluster evenallows for clusters that automatically grow as you launch more jobs and shrink asthey terminate.[ 288 ]Chapter 12This is the end of the book. We have come a long way. You learned how to performclassification when we labeled data and clustering when we do not. You learnedabout dimensionality reduction and topic modeling to make sense of large datasets.Towards the end, we looked at some specific applications (such as music genreclassification and computer vision). For implementations, we relied on Python.
Thislanguage has an increasingly expanding ecosystem of numeric computing packagesbuilt on top of NumPy. Whenever possible, we relied on scikit-learn, but usedother packages when necessary. Due to the fact that they all use the same basic datastructure (the NumPy multidimensional array), it's possible to mix functionalityfrom different packages seamlessly.
All of the packages used in this book are opensource and available for use in any project.Naturally, we did not cover every machine learning topic. In the Appendix, weprovide pointers to a selection of other resources that will help interested readerslearn more about machine learning.[ 289 ]Where to Learn MoreMachine LearningWe are at the end of our book and now take a moment to look at what else is outthere that could be useful for our readers.There are many wonderful resources out there to learn more about machinelearning—way too much to cover them all here. The following list can thereforerepresent only a small, and very biased, sampling of the resources the authors thinkare best at the time of writing.Online coursesAndrew Ng is a professor at Stanford who runs an online course in machine learningas a massive open online course at Coursera (http://www.coursera.org).
It is freeof charge, but may represent a significant time investment.BooksThis book is focused on the practical side of machine learning. We did not present thethinking behind the algorithms or the theory that justifies them. If you are interestedin that aspect of machine learning, then we recommend Pattern Recognition andMachine Learning by Christopher Bishop.
This is a classical introductory text in thefield. It will teach you the nitty-gritty of most of the algorithms we used in this book.If you want to move beyond the introduction and learn all the gory mathematicaldetails, then Machine Learning: A Probabilistic Perspective by Kevin P. Murphy is anexcellent option (www.cs.ubc.ca/~murphyk/MLbook). It's very recent (published in2012) and contains the cutting edge of ML research. This 1100 page book can alsoserve as a reference as very little of machine learning has been left out.[ 291 ]Where to Learn More Machine LearningQuestion and answer sitesMetaOptimize (http://metaoptimize.com/qa) is a machine learningquestion and answer website where many very knowledgeable researchersand practitioners interact.Cross Validated (http://stats.stackexchange.com) is a general statistics questionand answer site, which often features machine learning questions as well.As mentioned in the beginning of the book, if you have questions specific to particularparts of the book, feel free to ask them at TwoToReal (http://www.twotoreal.com).We try to be as quick as possible to jump in and help as best as we can.BlogsHere is an obviously non-exhaustive list of blogs, which are interesting to someoneworking in machine learning:• Machine Learning Theory: http://hunch.netThe average pace is approximately one post per month.
Posts are moretheoretical. They offer additional value in brain teasers.• Text & Data Mining by practical means: http://textanddatamining.blogspot.deAverage pace is one post per month, very practical, always surprisingapproaches.• Edwin Chen's Blog: http://blog.echen.meThe average pace is one post per month, providing more applied topics.• Machined Learnings: http://www.machinedlearnings.comThe average pace is one post per month, providing more applied topics.• FlowingData: http://flowingdata.comThe average pace is one post per day, with the posts revolving morearound statistics.• Simply Statistics: http://simplystatistics.orgSeveral posts per month, focusing on statistics and big data.[ 292 ]Appendix• Statistical Modeling, Causal Inference, and Social Science:http://andrewgelman.comOne post per day with often funny reads when the author points out flaws inpopular media, using statistics.Data sourcesIf you want to play around with algorithms, you can obtain many datasets from theMachine Learning Repository at the University of California at Irvine (UCI).
You canfind it at http://archive.ics.uci.edu/ml.Getting competitiveAn excellent way to learn more about machine learning is by trying out acompetition! Kaggle (http://www.kaggle.com) is a marketplace of ML competitionsand was already mentioned in the introduction. On the website, you will find severaldifferent competitions with different structures and often cash prizes.The supervised learning competitions almost always follow the following format:you (and every other competitor) are given access to labeled training data and testingdata (without labels). Your task is to submit predictions for testing data.
When thecompetition closes, whoever has the best accuracy wins. The prizes range from gloryto cash.Of course, winning something is nice, but you can gain a lot of useful experiencejust by participating. So, you have to stay tuned after the competition is over asparticipants start sharing their approaches in the forum. Most of the time, winning isnot about developing a new algorithm, but cleverly preprocessing, normalizing, andcombining existing methods.All that was left outWe did not cover every machine learning package available for Python.
Given thelimited space, we chose to focus on scikit-learn. However, there are other optionsand we list a few of them here:• MDP toolkit (http://mdp-toolkit.sourceforge.net): Modular toolkit fordata processing• PyBrain (http://pybrain.org): Python-based Reinforcement Learning,Artificial Intelligence, and Neural Network Library[ 293 ]Where to Learn More Machine Learning• Machine Learning Toolkit (Milk) (http://luispedro.org/software/milk):This package was developed by one of the authors of this book and coverssome algorithms and techniques that are not included in scikit-learn• Pattern (http://www.clips.ua.ac.be/pattern): A package that combinesweb mining, natural language processing, and machine learning, havingwrapper APIs for Google, Twitter, and Wikipedia.A more general resource is http://mloss.org, which is a repository of open sourcemachine learning software.
As is usually the case with repositories such as this one,the quality varies between excellent well maintained software and projects that wereone-offs and then abandoned. It may be worth checking out whether your problem isvery specific and none of the more general packages address it.SummaryWe are now truly at the end. We hope you enjoyed the book and feel well equippedto start your own machine learning adventure.We also hope you learned the importance of carefully testing your methods.
Inparticular, the importance of using correct cross-validation method and not reporttraining test results, which are an over-inflated estimate of how good your methodreally is.[ 294 ]IndexAAcceptedAnswerId 99additive smoothing 131add-one smoothing 131AmazonURL 274Amazon Web Services (AWS)about 274accessing 275cluster generation, automating withStarCluster 284-288using 274, 275virtual machines, creating 276-282Anaconda Python distributionURL 6area under curve (AUC) 118Associated Press (AP) 81association rules 194Auditory Filterbank Temporal Envelope(AFTE) 217Automatic Music Genre Classification(AMGC) 214AvgSentLen 106AvgWordLen 106Bbag of word approachdrawbacks 65less important words, removing 59, 60raw text, converting intobag of words 54, 55stemming 60word count vectors, normalizing 58, 59words, counting 55-58words, stopping on steroids 63-65BaseEstimator 152basket analysisabout 188, 189advanced baskets analysis 196association rule mining 194-196supermarket shopping baskets,analyzing 190-194useful predictions, obtaining 190BernoulliNB 135big dataabout 264jug, functioning 268, 269jug, using for data analysis 269-272partial results, reusing 272-274pipeline, breaking into tasks with jug 264tasks, introducing in jug 265-267binary classification 47-49blogs, machine learningURLs 292, 293Body attribute 99Cclasses 96classification 29classification modelbuilding 32-36cross-validation 36-39data, holding 36-39gain or loss function 40search procedure 39[ 295 ]structure 39classifierabout 96building, real data used 95building, with FFT 205classy answers, classifying 96confusion matrix, using 207-209creating 100data, fetching 97data instance, tuning 96experimentation agility, increasing 205, 206features, designing 104-107features, engineering 101-103kNN, starting with 100, 101logistic regression, using 112, 207performance, improving with MelFrequency Cepstrum (MFC)Coefficients 214-217performance, measuring 103, 104performance, measuringwith receiver-operatorcharacteristics (ROC) 210-212precision, measuring 116-119recall, measuring 116-119roadmap, sketching 96serializing 121slimming 120training 103tuning 96clusteringabout 66hierarchical clustering 66k-means 66-69posts 72testing 70, 71coefficient of determination 160CommentCount 99compactness 42complex classifierbuilding 39, 40nearest neighbor classifier 43complex datasetabout 41feature engineering 42Seeds dataset 41computer visionimage processing 219local feature representations 235-238CourseraURL 291CreationDate 98Cross Validatedabout 292URL 292cross-validation 37-39Ddata, classifierattributes, preselecting 98fetching 97slimming, to chewable chunks 98training data, creating 100data sources, machine learning 293dimensionality reductionabout 87, 241feature extraction 254features, selecting 242multidimensional scaling 258-261roadmap, sketching 242documentscomparing, by topics 86-89EElastic Compute Cluster (EC2) service 274ElasticNet model 165English-language Wikipedia modelbuilding 89-92ensemble learning 186Enthought CanopyURL 6Ffast Fourier transform (FFT)about 203used, for building classifier 205feature engineering 42feature extractionabout 254[ 296 ]Linear Discriminant Analysis (LDA) 257PCA, applying 255, 256PCA, limitations 257PCA, sketching 255principal component analysis (PCA) 254feature selectionabout 42, 242correlation 243-246methods 253model, features asking for 251-253mutual information 246-251redundant features,detecting with filters 242fit(document, y=None) method 152F-measure 142free tier 275GGaussianNB 134get_feature_names() method 152Grid Engine 264GridSearchCV 141Hhierarchical clustering 66hierarchical Dirichlet process (HDP) 93house prices, predicting with regressionabout 157-161cross-validation, for regression 162, 163multidimensional regression 161Iimage processingabout 219, 220basic image classification 228center, putting in focus 225-227custom features, writing 230, 231features, computing from images 229, 230features, used for finding similarimages 232-234Gaussian blurring 223-225harder dataset, classifying 234, 235images, displaying 220, 221images, loading 220, 221thresholding 222improvement, classifierbias-variance 108high bias 109high bias, fixing 108high variance, fixing 109high variance problem, hinting 109-111steps 107, 108initial challengeimpression of noise example 75solving 73-75Iris datasetabout 30classification model, building 32-36features 30visualization 30, 31Jjugabout 263running, on cloud machine 283, 284using, for data analysis 269-272working 268, 269jug cleanup 274jug invalidate 274jug status --cache 274KKaggleURL 293k-means 66-69Llabels 96Laplace smoothing 131Lasso 165latent Dirichlet allocation (LDA)about 80topic model, building 81-86Wikipedia URL 80lift 194Linear Discriminant Analysis (LDA) 257[ 297 ]local feature representations 235-238logistic regressionabout 112applying, to post classificationproblem 114-116example 112-114using 112LSF (Load Sharing Facility) 264Mmachine learningabout 2resources 291tiny application 13Machine Learning Toolkit (Milk)URL 294matplotlibabout 6URL 6matshow() function 208MDP toolkitURL 293Mel Frequency Cepstrum (MFC)used, for improving classificationperformance 214-217MetaOptimizeabout 292URL 292MLCompURL 70model, tiny applicationcomplex model 20-22data, viewing 22-25model function, calculating 27selecting 18straight line model 18-20testing 26, 27training 26, 27mpmathURL 132multiclass classification 47-49multidimensional regressionabout 161using 161, 162multidimensional scaling (MDS) 258-261MultinomialNB classifier 135, 141musicanalyzing 201, 202data, fetching 200decomposing, into sine wavecomponents 203-205wave format, converting into 200NNaïve Bayes classifierabout 124, 125arithmetic underflows,accounting for 132-134BernoulliNB 135classes, using 138, 139GaussianNB 134MultinomialNB 135Naïve Bayes theorem 125, 126parameters, tuning 141-145problem, solving 135-138unseen words, accounting for 131using, to classify 127-130working 126Natural Language Toolkit (NLTK)about 60installing 60, 61URL 60vectorizer, extending with 62nearest neighbor classifier 43neighborhood approach,recommendations 180-183NumAllCaps 106NumExclams 106NumPyabout 6examples 6, 7indexing 9learning 7-9nonexisting values, handling 10runtime, comparing 11URL, for examples 7[ 298 ]Oone-dimensional regression 158online course, machine learningURL 291opinion mining 123ordinary least squares (OLS) regression 157Otsu 222overfitting 22OwnerUserId 99Pparameters, clusteringtweaking 76Part Of Speech (POS) 123PBS (Portable Batch System) 264penalized regressionabout 163ElasticNet, using in scikit-learn 165example, text documents 168, 169hyperparameters, setting inprincipled way 170-174L1 penalties 164, 165L2 penalties 164, 165Lasso path, visualizing 166, 167Lasso, using in scikit-learn 165P greater than N scenarios 167Penn Treebank ProjectURL 148POS tag abbreviations,Penn Treebank 149, 150PostTypeId attribute 98precision_recall_curve() function 117predictions, rating with regressionabout 175-177dataset, splitting into training andtesting 177, 178training data, normalizing 178-180pre-processing phaseachievements 65goals 65principal component analysis (PCA)about 254applying 255, 256limitations 257properties 254sketching 255PyBrainURL 293Pythoninstalling 6packages, installing, on Amazon Linux 282URL 6QQ&A sitesCross Validated 5Kaggle 5MetaOptimize 5Stack Overflow 5TwoToReal 5Rreceiver-operator characteristics (ROC)about 210used, for measuring classifierperformance 210-212recommendationsabout 175multiple methods, combining 186-188neighborhood approach 180-183regression approach 184regressionabout 165cross-validation 162regression approach, recommendationsabout 184issues 184regularized regression.