Building machine learning systems with Python (779436), страница 5
Текст из файла (страница 5)
Whatever numerical heavy algorithm you takefrom current books on numerical recipes, most likely you will find support for themin SciPy in one way or the other. Whether it is matrix manipulation, linear algebra,optimization, clustering, spatial operations, or even fast Fourier transformation, thetoolbox is readily filled. Therefore, it is a good habit to always inspect the scipymodule before you start implementing a numerical algorithm.For convenience, the complete namespace of NumPy is also accessible via SciPy.
So,from now on, we will use NumPy's machinery via the SciPy namespace. You cancheck this easily comparing the function references of any base function, such as:>>> import scipy, numpy>>> scipy.version.full_version0.14.0>>> scipy.dot is numpy.dotTrueThe diverse algorithms are grouped into the following toolboxes:SciPy packagesclusterFunctionalities• Hierarchical clustering (cluster.hierarchy)• Vector quantization / k-means (cluster.vq)[ 12 ]SciPy packagesconstantsFunctionalitiesfftpackDiscrete Fourier transform algorithmsintegrateIntegration routinesinterpolateInterpolation (linear, cubic, and so on)ioData input and outputlinalgLinear algebra routines using the optimized BLAS and LAPACKlibrariesndimagen-dimensional image packageodrOrthogonal distance regressionoptimizeOptimization (finding minima and roots)signalSignal processingsparseSparse matricesspatialSpatial data structures and algorithmsspecialSpecial mathematical functions such as Bessel or JacobianstatsStatistics toolkit• Physical and mathematical constants• Conversion methodsThe toolboxes most interesting to our endeavor are scipy.stats, scipy.interpolate, scipy.cluster, and scipy.signal.
For the sake of brevity,we will briefly explore some features of the stats package and leave the othersto be explained when they show up in the individual chapters.Our first (tiny) application of machinelearningLet's get our hands dirty and take a look at our hypothetical web start-up, MLaaS,which sells the service of providing machine learning algorithms via HTTP. Withincreasing success of our company, the demand for better infrastructure increasesto serve all incoming web requests successfully. We don't want to allocate toomany resources as that would be too costly.
On the other side, we will lose money,if we have not reserved enough resources to serve all incoming requests. Now,the question is, when will we hit the limit of our current infrastructure, which weestimated to be at 100,000 requests per hour. We would like to know in advancewhen we have to request additional servers in the cloud to serve all the incomingrequests successfully without paying for unused ones.[ 13 ]Getting Started with Python Machine LearningReading in the dataWe have collected the web stats for the last month and aggregated them in ch01/data/web_traffic.tsv (.tsv because it contains tab-separated values). They arestored as the number of hits per hour.
Each line contains the hour consecutively andthe number of web hits in that hour.The first few lines look like the following:Using SciPy's genfromtxt(), we can easily read in the data using the following code:>>> import scipy as sp>>> data = sp.genfromtxt("web_traffic.tsv", delimiter="\t")We have to specify tab as the delimiter so that the columns are correctly determined.[ 14 ]A quick check shows that we have correctly read in the data:>>> print(data[:10])[[1.00000000e+002.27200000e+03][2.00000000e+00nan][3.00000000e+001.38600000e+03][4.00000000e+001.36500000e+03][5.00000000e+001.48800000e+03][6.00000000e+001.33700000e+03][7.00000000e+001.88300000e+03][8.00000000e+002.28300000e+03][9.00000000e+001.33500000e+03][1.00000000e+011.02500000e+03]]>>> print(data.shape)(743, 2)As you can see, we have 743 data points with two dimensions.Preprocessing and cleaning the dataIt is more convenient for SciPy to separate the dimensions into two vectors, eachof size 743.
The first vector, x, will contain the hours, and the other, y, will containthe Web hits in that particular hour. This splitting is done using the special indexnotation of SciPy, by which we can choose the columns individually:x = data[:,0]y = data[:,1]There are many more ways in which data can be selected from a SciPy array.Check out http://www.scipy.org/Tentative_NumPy_Tutorial for moredetails on indexing, slicing, and iterating.One caveat is still that we have some values in y that contain invalid values, nan. Thequestion is what we can do with them. Let's check how many hours contain invaliddata, by running the following code:>>> sp.sum(sp.isnan(y))8[ 15 ]Getting Started with Python Machine LearningAs you can see, we are missing only 8 out of 743 entries, so we can afford to removethem. Remember that we can index a SciPy array with another array. Sp.isnan(y)returns an array of Booleans indicating whether an entry is a number or not.
Using~, we logically negate that array so that we choose only those elements from x and ywhere y contains valid numbers:>>> x = x[~sp.isnan(y)]>>> y = y[~sp.isnan(y)]To get the first impression of our data, let's plot the data in a scatter plot usingmatplotlib. matplotlib contains the pyplot package, which tries to mimic MATLAB'sinterface, which is a very convenient and easy to use one as you can see in thefollowing code:>>> import matplotlib.pyplot as plt>>> # plot the (x,y) points with dots of size 10>>> plt.scatter(x, y, s=10)>>> plt.title("Web traffic over the last month")>>> plt.xlabel("Time")>>> plt.ylabel("Hits/hour")>>> plt.xticks([w*7*24 for w in range(10)],['week %i' % w for w in range(10)])>>> plt.autoscale(tight=True)>>> # draw a slightly opaque, dashed grid>>> plt.grid(True, linestyle='-', color='0.75')>>> plt.show()You can find more tutorials on plotting at http://matplotlib.org/users/pyplot_tutorial.html.[ 16 ]In the resulting chart, we can see that while in the first weeks the traffic stayed moreor less the same, the last week shows a steep increase:Choosing the right model and learningalgorithmNow that we have a first impression of the data, we return to the initial question:How long will our server handle the incoming web traffic? To answer this we haveto do the following:1.
Find the real model behind the noisy data points.2. Following this, use the model to extrapolate into the future to find the pointin time where our infrastructure has to be extended.[ 17 ]Getting Started with Python Machine LearningBefore building our first model…When we talk about models, you can think of them as simplified theoreticalapproximations of complex reality. As such there is always some inferiorityinvolved, also called the approximation error. This error will guide us in choosingthe right model among the myriad of choices we have.
And this error will becalculated as the squared distance of the model's prediction to the real data; forexample, for a learned model function f, the error is calculated as follows:def error(f, x, y):return sp.sum((f(x)-y)**2)The vectors x and y contain the web stats data that we have extracted earlier. It isthe beauty of SciPy's vectorized functions that we exploit here with f(x).
The trainedmodel is assumed to take a vector and return the results again as a vector of the samesize so that we can use it to calculate the difference to y.Starting with a simple straight lineLet's assume for a second that the underlying model is a straight line. Then thechallenge is how to best put that line into the chart so that it results in the smallestapproximation error. SciPy's polyfit() function does exactly that. Given data x andy and the desired order of the polynomial (a straight line has order 1), it finds themodel function that minimizes the error function defined earlier:fp1, residuals, rank, sv, rcond = sp.polyfit(x, y, 1, full=True)The polyfit() function returns the parameters of the fitted model function, fp1.And by setting full=True, we also get additional background information on thefitting process.
Of this, only residuals are of interest, which is exactly the error ofthe approximation:>>> print("Model parameters: %s" % fp1)Model parameters: [2.59619213989.02487106]>>> print(residuals)[3.17389767e+08]This means the best straight line fit is the following functionf(x) = 2.59619213 * x + 989.02487106.We then use poly1d() to create a model function from the model parameters:>>> f1 = sp.poly1d(fp1)>>> print(error(f1, x, y))317389767.34[ 18 ]We have used full=True to retrieve more details on the fitting process. Normally,we would not need it, in which case only the model parameters would be returned.We can now use f1() to plot our first trained model.