Building machine learning systems with Python (779436), страница 23
Текст из файла (страница 23)
It is the valueThe tricky part is the calculation of the likelihooddescribing how likely it is to see feature values and if we know that theclass of the data instance is . To estimate this, we need to do some thinking.Being naïveFrom probability theory, we also know the following relationship:This alone, however, does not help much, since we treat one difficult problem(estimating) with another one (estimating).However, if we naïvely assume thatand are independent from each other,simplifies toand we can write it as follows:[ 126 ]Chapter 6Putting everything together, we get the quite manageable formula:The interesting thing is that although it is not theoretically correct to simply tweakour assumptions when we are in the mood to do so, in this case, it proves to workastonishingly well in real-world applications.Using Naïve Bayes to classifyGiven a new tweet, the only part left is to simply calculate the probabilities:Then choose the classhaving higher probability., is the same, we can simply ignore itAs for both classes the denominator,without changing the winner class.Note, however, that we don't calculate any real probabilities any more.
Instead, we areestimating which class is more likely, given the evidence. This is another reason whyNaïve Bayes is so robust: It is not so much interested in the real probabilities, but onlyin the information, which class is more likely. In short, we can write:This is simply telling that we are calculating the part after argmax for all classes of(pos and neg in our case) and returning the class that results in the highest value.[ 127 ]Classification II – Sentiment AnalysisBut, for the following example, let's stick to real probabilities and do somecalculations to see how Naïve Bayes works. For the sake of simplicity, we willassume that Twitter allows only for the two aforementioned words, "awesome"and "crazy", and that we had already manually classified a handful of tweets:TweetClassawesomePositive tweetawesomePositive tweetawesome crazyPositive tweetcrazyPositive tweetcrazyNegative tweetcrazyNegative tweetIn this example, we have the tweet "crazy" both in a positive and negative tweet toemulate some ambiguities you will often find in the real world (for example, "beingsoccer crazy" versus "a crazy idiot").In this case, we have six total tweets, out of which four are positive and two negative,which results in the following priors:This means, without knowing anything about the tweet itself, it would be wise toassume the tweet to be positive.A still missing piece is the calculation ofandprobabilities for the two featuresand, which are theconditioned in class .This is calculated as the number of tweets, in which we have seen the concretefeature divided by the number of tweets that have been labeled with the class of .Let's say we want to know the probability of seeing "awesome" occurring in a tweet,knowing that its class is positive, we will have:[ 128 ]Chapter 6Because out of the four positive tweets three contained the word "awesome".Obviously, the probability for not having "awesome" in a positive tweet is its inverse:Similarly, for the rest (omitting the case that a word is not occurring in a tweet):For the sake of completeness, we will also compute the evidence so that we can seeandreal probabilities in the following example tweets.
For two concrete values of, we can calculate the evidence as follows:This leads to the following values:[ 129 ]Classification II – Sentiment AnalysisNow we have all the data to classify new tweets. The only work left is to parse thetweet and featurize it:TweetClass probabilitiesClassification"awesome"10Positive"crazy"01Negative"awesomecrazy"11PositiveSo far, so good.
The classification of trivial tweets seems to assign correct labels tothe tweets. The question remains, however, how we should treat words that did notoccur in our training corpus. After all, with the preceding formula, new words willalways be assigned a probability of zero.[ 130 ]Chapter 6Accounting for unseen words and otherodditiesWhen we calculated the probabilities earlier, we actually cheated ourselves. We werenot calculating the real probabilities, but only rough approximations by means ofthe fractions.
We assumed that the training corpus will tell us the whole truth aboutthe real probabilities. It did not. A corpus of only six tweets obviously cannot give usall the information about every tweet that has ever been written. For example, therecertainly are tweets containing the word "text" in them. It is only that we have neverseen them. Apparently, our approximation is very rough and we should account forthat. This is often done in practice with the so-called add-one smoothing.Add-one smoothing is sometimes also referred to as additivesmoothing or Laplace smoothing.
Note that Laplace smoothinghas nothing to do with Laplacian smoothing, which is related to thesmoothing of polygon meshes. If we do not smooth by 1 but by anadjustable parameter alpha<0, it is called Lidstone smoothing.It is a very simple technique that adds one to all feature occurrences. It has theunderlying assumption that even if we have not seen a given word in the wholecorpus, there is still a chance that it is just that our sample of tweets happened to notinclude that word. So, with add-one smoothing we pretend that we have seen everyoccurrence once more than we actually did. That means that instead of calculating, we now do.Why do we add 2 in the denominator? Because we have two features: the occurrenceof "awesome" and "crazy".
Since we add 1 for each feature, we have to make sure thatthe end result is again a probability. And indeed, we get 1 as the total probability:[ 131 ]Classification II – Sentiment AnalysisAccounting for arithmetic underflowsThere is yet another road block. In reality, we work with probabilities much smallerthan the ones we have dealt with in the toy example. Typically, we also have manymore than only two features, which we multiply with each other.
This will quicklylead to the point where the accuracy provided by NumPy does not suffice any more:>>> import numpy as np>>> np.set_printoptions(precision=20) # tell numpy to print out moredigits (default is 8)>>> np.array([2.48E-324])array([ 4.94065645841246544177e-324])>>> np.array([2.47E-324])array([ 0.])So, how probable is it that we will ever hit a number like 2.47E-324? To answer this,we just need to imagine a likelihood for the conditional probabilities of 0.0001, andthen multiply 65 of them together (meaning that we have 65 low probable featurevalues) and you've been hit by the arithmetic underflow:>>> x = 0.00001>>> x**64 # still fine1e-320>>> x**65 # ouch0.0A float in Python is typically implemented using double in C.
To find out whetherthis is the case for your platform you can check it as follows:>>> import sys>>> sys.float_infosys.float_info(max=1.7976931348623157e+308, max_exp=1024,max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021,min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16,radix=2, rounds=1)To mitigate this, one could switch to math libraries such as mpmath (http://code.google.com/p/mpmath/) that allow for arbitrary accuracy. However, they are notfast enough to work as a NumPy replacement.[ 132 ]Chapter 6Fortunately, there is a better way to take care of this, and it has to do with a nicerelationship that we might still remember from school:If we apply it to our case, we get the following:As the probabilities are in the interval between 0 and 1, the log of the probabilitieslies in the interval -∞ and 0. Don't be bothered with that.
Higher numbers are still astronger indicator for the correct class—it is only that they are negative now.There is one caveat though: we actually don't have the log in the formula's nominator(the part above the fraction). We only have the product of the probabilities. In ourcase, luckily, we are not interested in the actual value of the probabilities. We simplywant to know which class has the highest posterior probability.
We are lucky,because if we find that, then we will alwaysalso have.[ 133 ]Classification II – Sentiment AnalysisA quick look at the preceding graph shows that the curve is monotonically increasing,that is, it never goes down, when we go from left to right.
So let's stick this into theaforementioned formula:This will finally retrieve the formula for two features that will give us the best classalso for the real-world data that we will see in practice:Of course, we will not be very successful with only two features, so, let's rewrite it toallow for an arbitrary number of features:There we are, ready to use our first classifier from the scikit-learn toolkit.As mentioned earlier, we just learned the Bernoulli model of Naïve Bayes.
Insteadof having Boolean features, we can also use the number of word occurrences, alsoknown as the Multinomial model. As this provides more information, and oftenalso results in better performance, we will use this for our real-world data. Note,however, that the underlying formulas change a bit. However, no worries, as thegeneral idea how Naïve Bayes works, is still the same.Creating our first classifier and tuning itThe Naïve Bayes classifiers resides in the sklearn.naive_bayes package.