Building machine learning systems with Python (779436), страница 32
Текст из файла (страница 32)
Itis quite typical that combining methods is a simple way to obtain a small performanceboost, but that the results are not earth shattering.By having a flexible way to combine multiple methods, we can simply try any ideawe wish by adding it into the mix of learners and letting the system fold it into theprediction. We can, for example, replace the neighborhood criterion in the nearestneighbor code.However, we do have to be careful to not overfit our dataset.
In fact, if werandomly try too many things, some of them will work well on this dataset, butwill not generalize. Even though we are splitting our data, we are not rigorouslycross-validating our design decisions. In order to have a good estimate, and if datais plentiful, you should leave a portion of the data untouched until you have a finalmodel that is about to go into production. Then, testing your model on this held outdata gives you an unbiased prediction of how well you should expect it to work inthe real world.Basket analysisThe methods we have discussed so far work well when you have numeric ratings ofhow much a user liked a product. This type of information is not always available, asit requires active behavior on the part of consumers.[ 188 ]Chapter 8Basket analysis is an alternative mode of learning recommendations.
In this mode,our data consists only of what items were bought together; it does not containany information on whether individual items were enjoyed or not. Even if userssometimes buy items they regret, on average, knowing their purchases gives youenough information to build good recommendations. It is often easier to get thisdata rather than rating data, as many users will not provide ratings, while the basketdata is generated as a side effect of shopping. The following screenshot shows you asnippet of Amazon.com's web page for Tolstoy's classic book War and Peace, whichdemonstrates a common way to use these results:This mode of learning is not only applicable to actual shopping baskets, naturally.It is applicable in any setting where you have groups of objects together and needto recommend another. For example, recommending additional recipients to auser writing an e-mail is done by Gmail and could be implemented using similartechniques (we do not know what Gmail uses internally; perhaps, they combinemultiple techniques, as we did earlier).
Or, we could use these methods to developan app to recommend web pages to visit based on your browsing history. Even ifwe are handling purchases, it may make sense to group all purchases by a customerinto a single basket, independently of whether the items were bought together or onseparate transactions. This depends on the business context, but keep in mind thatthe techniques are flexible and can be useful in many settings.[ 189 ]RecommendationsBeer and diapers. One of the stories that is often mentioned in thecontext of basket analysis is the diapers and beer story. It says that whensupermarkets first started to look at their data, they found that diaperswere often bought together with beer. Supposedly, it was the father whowould go out to the supermarket to buy diapers and then would pickup some beer as well.
There has been much discussion of whether thisis true or just an urban myth. In this case, it seems that it is true. In theearly 1990s, Osco Drug did discover that in the early evening, beer anddiapers were bought together, and it did surprise the managers who had,until then, never considered these two products to be similar. What is nottrue is that this led the store to move the beer display closer to the diapersection. Also, we have no idea whether it was really that fathers werebuying beer and diapers together more than mothers (or grandparents).Obtaining useful predictionsIt is not just "customers who bought X also bought Y" even though that is howmany online retailers phrase it (see the Amazon.com screenshot given earlier); a realsystem cannot work like this.
Why not? Because such a system would get fooled byvery frequently bought items and would simply recommend that which is popularwithout any personalization.For example, at a supermarket, many customers buy bread every time they shopor close to it (for the sake of argument, let us say that 50 percent of visits includebread). So, if you focus on any particular item, say dishwasher soap and look at whatis frequently bought with dishwasher soap, you might find that bread is frequentlybought with soap.
In fact, just by random chance, 50 percent of the times someonebuys dishwasher soap, they buy bread. However, bread is frequently bought withanything else just because everybody buys bread very often.What we are really looking for is "customers who bought X, are statistically more likelyto buy Y than the average customer who has not bought X". If you buy dishwashersoap, you are likely to buy bread, but not more so than the baseline.
Similarly, abookstore that simply recommended bestsellers no matter which books you hadalready bought would not be doing a good job of personalizing recommendations.Analyzing supermarket shopping basketsAs an example, we will look at a dataset consisting of anonymous transactions at asupermarket in Belgium. This dataset was made available by Tom Brijs at HasseltUniversity. Due to privacy concerns, the data has been anonymized, so we only havea number for each product and a basket is a set of numbers. The data file is availablefrom several online sources (including this book's companion website).[ 190 ]Chapter 8We begin by loading the dataset and looking at some statistics (this is always agood idea):>>> from collections import defaultdict>>> from itertools import chain>>> # File is downloaded as a compressed file>>> import gzip>>> # file format is a line per transaction>>> # of the form '12 34 342 5...'>>> dataset = [[int(tok) for tok in line.strip().split()]...for line in gzip.open('retail.dat.gz')]>>> # It is more convenient to work with sets>>> dataset = [set(d) for d in dataset]>>> # count how often each product was purchased:>>> counts = defaultdict(int)>>> for elem in chain(*dataset):...counts[elem] += 1We can see the resulting counts summarized in the following table:# of times bought# of productsJust once2,2242 or 32,4384 to 72,5088 to 152,25116 to 312,18232 to 631,94064 to 1271,523128 to 5111,225512 or more179There are many products that have only been bought a few times.
For example,33 percent of products were bought four or fewer times. However, this represents only1 percent of purchases. This phenomenon that many products are only purchased asmall number of times is sometimes labeled the long tail and has only become moreprominent as the Internet made it cheaper to stock and sell niche items. In order to beable to provide recommendations for these products, we would need a lot more data.[ 191 ]RecommendationsThere are a few open source implementations of basket analysis algorithms outthere, but none that are well integrated with scikit-learn or any of the other packageswe have been using.
Therefore, we are going to implement one classic algorithmourselves. This algorithm is called the Apriori algorithm, and it is a bit old (it waspublished in 1994 by Rakesh Agrawal and Ramakrishnan Srikant), but it still works(algorithms, of course, never stop working, they just get superceded by better ideas).Formally, the Apriori algorithm takes a collection of sets (that is, your shoppingbaskets) and returns sets that are very frequent as subsets (that is, items that togetherare part of many shopping baskets).The algorithm works using a bottom-up approach: starting with the smallestcandidates (those composed of one single element), it builds up, adding one elementat a time.
Formally, the algorithm takes a set of baskets and the minimum inputthat should be considered (a parameter we will call minsupport). The first step isto consider all baskets with just one element with minimal support. Then, these arecombined in all possible ways to build up two-element baskets. These are filtered inorder to keep only those that have minimal support. Then, all possible three-elementbaskets are considered and those with minimal support are kept, and so on. The trickof Apriori is that when building a larger basket, it only needs to consider those that arebuilt up of smaller sets.The following diagram presents a schematic view of the algorithm:We shall now implement this algorithm in code.
We need to define the minimumsupport we are looking for:>>> minsupport = 80[ 192 ]Chapter 8Support is the number of times a set of products was purchased together. The goalof Apriori is to find itemsets with high support. Logically, any itemset with morethan minimal support can only be composed of items which themselves have atleast minimal support:>>> valid = set(k for k,v in counts.items()...if (v >= minsupport))Our initial itemsets are singletons (sets with a single element).
In particular, allsingletons that have at least minimal support are frequent itemsets:>>>itemsets = [frozenset([v]) for v in valid]Now, our loop is given as follows:>>> freqsets = []>>> for i in range(16):...nextsets = []...tested = set()...for it in itemsets:...for v in valid:...if v not in it:...# Create a new candidate set by adding v to it...c = (it | frozenset([v]))...# check If we have tested it already...if c in tested:......continuetested.add(c)......# Count support by looping over dataset...# This step is slow....# Check `apriori.py` for a better implementation....support_c = sum(1 for d in dataset ifd.issuperset(c))...if support_c > minsupport:...nextsets.append(c)...freqsets.extend(nextsets)...itemsets = nextsets[ 193 ]Recommendations...if not len(itemsets):...break>>> print("Finished!")Finished!This works correctly, but it is slow.
A better implementation has more infrastructureto avoid having to loop over all the datasets to get the count (support_c). Inparticular, we keep track of which shopping baskets have which frequent itemsets.This accelerates the loop but makes the code harder to follow. Therefore we willnot show it here. As usual, you can find both the implementations on this book'scompanion website.