Building machine learning systems with Python (779436), страница 33
Текст из файла (страница 33)
The code there is also wrapped into a function that can beapplied to other datasets.The Apriori algorithm returns frequent itemsets, that is, baskets that are presentabove a certain threshold (given by the minsupport variable in the code).Association rule miningFrequent itemsets are not very useful by themselves. The next step is to buildassociation rules. Because of this final goal, the whole field of basket analysis issometimes called association rule mining.An association rule is a statement of the type "If X, then Y", for example, "if a customerbought War and Peace, then they will buy Anna Karenina".
Note that the rule is notdeterministic (not all customers who buy X will buy Y), but it is rather cumbersome toalways spell it out: "if a customer bought X, he is more likely than baseline to buy Y";thus, we say "if X, then Y", but we mean it in a probabilistic sense.Interestingly, both the antecedent and the conclusion may contain multiple objects:costumers who bought X, Y, and Z also bought A, B, and C. Multiple antecedents mayallow you to make more specific predictions than are possible from a single item.You can get from a frequent set to a rule by just trying all the possible combinationsof X implies Y.
It is easy to generate many of these rules. However, you only want tohave valuable rules. Therefore, we need to measure the value of a rule. A commonlyused measure is called the lift. The lift is the ratio between the probability obtainedby applying the rule and the baseline, as in the following formula:[ 194 ]Chapter 8In the preceding formula, P(Y) is the fraction of all the transactions that includeY, while P(Y|X) is the fraction of transactions that include Y, given that they alsoinclude X. Using the lift helps avoid the problem of recommending bestsellers; for abestseller, both P(Y) and P(Y|X) will be large. Therefore, the lift will be close to oneand the rule will be deemed irrelevant. In practice, we wish to have values of lift of atleast 10, perhaps even 100.Refer to the following code:>>> minlift = 5.0>>> nr_transactions = float(len(dataset))>>> for itemset in freqsets:...for item in itemset:...consequent = frozenset([item])...antecedent = itemset-consequent...base = 0.0...# acount: antecedent count...acount = 0.0......# ccount : consequent count...ccount = 0.0...for d in dataset:...if item in d: base += 1...if d.issuperset(itemset): ccount += 1...if d.issuperset(antecedent): acount += 1...base /= nr_transactions...p_y_given_x = ccount/acount...lift = p_y_given_x / base...if lift > minlift:......print('Rule {0} ->{1} has lift {2}'.format(antecedent, consequent,lift))[ 195 ]RecommendationsSome of the results are shown in the following table.
The counts are the numberof transactions which include the consequent alone (that is, the base rate at whichthat product is bought), all the items in the antecedent, and all the items in theantecedent and the consequent.AntecedentConsequentConsequentcountAntecedentcountAntecedent &consequentcountLift1,378, 1,379,1,3801,269279 (0.3percent)805722548, 41, 9761171026 (1.1percent)122513548, 41,1,601116,0101316 (1.5percent )16515964We can see, for example, that there were 80 transactions in which 1,378, 1,379,and 1,380 were bought together. Of these, 57 also included 1,269, so the estimatedconditional probability is 57/80 ≈ 71 percent. Compared to the fact that only 0.3percent of all transactions included 1,269, this gives us a lift of 255.The need to have a decent number of transactions in these counts in order to beable to make relatively solid inferences is why we must first select frequent itemsets.If we were to generate rules from an infrequent itemset, the counts would be verysmall; due to this, the relative values would be meaningless (or subject to very largeerror bars).Note that there are many more association rules discovered from this dataset: thealgorithm discovers 1,030 rules (requiring support for the baskets of at least 80 anda minimum lift of 5).
This is still a small dataset when compared to what is nowpossible with the web. With datasets containing millions of transactions, you canexpect to generate many thousands of rules, even millions.However, for each customer or each product, only a few rules will be relevant at anygiven time. So each costumer only receives a small number of recommendations.More advanced basket analysisThere are now other algorithms for basket analysis that run faster than Apriori.
Thecode we saw earlier was simple and it was good enough for us, as we only had circa100 thousand transactions. If we had many millions, it might be worthwhile to usea faster algorithm. Note, though, that learning association rules can often be doneoffline, where efficiency is not as great a concern.[ 196 ]Chapter 8There are also methods to work with temporal information, leading to rules thattake into account the order in which you have made your purchases. Consider, asan example, that someone buying supplies for a large party may come back for trashbags.
Thus, it may make sense to propose trash bags on the first visit. However, itwould not make sense to propose party supplies to everyone who buys a trash bag.SummaryIn this chapter, we started by using regression for rating predictions. We saw a coupleof different ways in which to do so, and then combined them all in a single predictionby learning a set of weights. This technique, ensemble learning, in particular stackedlearning, is a general technique that can be used in many situations, not just forregression.
It allows you to combine different ideas even if their internal mechanicsare completely different; you can combine their final outputs.In the second half of the chapter, we switched gears and looked at another mode ofproducing recommendations: shopping basket analysis or association rule mining. Inthis mode, we try to discover (probabilistic) association rules of the "customers whobought X are likely to be interested in Y" form. This takes advantage of the data thatis generated from sales alone without requiring users to numerically rate items. Thisis not available in scikit-learn at this moment, so we wrote our own code.Association rule mining needs to be careful to not simply recommend bestsellersto every user (otherwise, what is the point of personalization?). In order to do this,we learned about measuring the value of rules in relation to the baseline, using ameasure called the lift of a rule.At this point in the book, we have seen the major modes of machine learning:classification.
In the next two chapters, we will look at techniques used for twospecific kinds of data, music and images. Our first goal will be to build a musicgenre classifier.[ 197 ]Classification – Music GenreClassificationSo far, we have had the luxury that every training data instance could easily bedescribed by a vector of feature values. In the Iris dataset, for example, the flowersare represented by vectors containing values for length and width of certain aspectsof a flower. In the text-based examples, we could transform the text into a bag ofword representations and manually craft our own features that captured certainaspects of the texts.It will be different in this chapter, when we try to classify songs by their genre.
Or,how would we, for instance, represent a three-minute-long song? Should we takethe individual bits of its MP3 representation? Probably not, since treating it like atext and creating something like a "bag of sound bites" would certainly be way toocomplex. Somehow, we will, nevertheless, have to convert a song into a series ofvalues that describe it sufficiently.Sketching our roadmapThis chapter will show how we can come up with a decent classifier in a domainthat is outside our comfort zone.
For one, we will have to use sound-based features,which are much more complex than the text-based ones we have used before.And then we will learn how to deal with multiple classes, whereas we have onlyencountered binary classification problems up to now. In addition, we will get toknow new ways of measuring classification performance.Let us assume a scenario in which, for some reason, we find a bunch of randomlynamed MP3 files on our hard disk, which are assumed to contain music.