Building machine learning systems with Python (779436), страница 26
Текст из файла (страница 26)
PosScore and NegScore together will help us todetermine the neutrality of the word, which is 1-PosScore-NegScore. SynsetTermslists all words in the set that are synonyms. We can safely ignore the ID andDescription columns for our purposes.The synset terms have a number appended, because some occur multiple times indifferent synsets.
For example, "fantasize" conveys two quite different meanings,which also leads to different scores:POSIDPosScoreNegScoreSynsetTermsDescriptionv016368590.3750fantasize#2fantasise#2Portray in the mind; "heis fantasizing the idealwife"v0163736800.125fantasy#1fantasize#1fantasise#1Indulge in fantasies; "he isfantasizing when he sayshe plans to start his owncompany"To find out which of the synsets to take, we will need to really understand themeaning of the tweets, which is beyond the scope of this chapter. The field ofresearch that is focusing on this challenge is called word-sense-disambiguation.For our task, we take the easy route and simply average the scores over all thesynsets, in which a term is found. For "fantasize", PosScore will be 0.1875 andNegScore will be 0.0625.The following function, load_sent_word_net(), does all that for us and returnsa dictionary where the keys are strings of the form word type/word, for example,n/implant, and the values are the positive and negative scores:import csv, collectionsdef load_sent_word_net():# making our life easier by using a dictionary that# automatically creates an empty list whenever we access# a not yet existing keysent_scores = collections.defaultdict(list)with open(os.path.join(DATA_DIR, SentiWordNet_3.0.0_20130122.txt"),"r") as csvfile:reader = csv.reader(csvfile, delimiter='\t',quotechar='"')[ 151 ]Classification II – Sentiment Analysisfor line in reader:if line[0].startswith("#"):continueif len(line)==1:continuePOS, ID, PosScore, NegScore, SynsetTerms, Gloss = lineif len(POS)==0 or len(ID)==0:continuefor term in SynsetTerms.split(" "):# drop number at the end of every termterm = term.split("#")[0]term = term.replace("-", " ").replace("_", " ")key = "%s/%s"%(POS, term.split("#")[0])sent_scores[key].append((float(PosScore),float(NegScore)))for key, value in sent_scores.items():sent_scores[key] = np.mean(value, axis=0)return sent_scoresOur first estimatorNow, we have everything in place to create our own first vectorizer.
The mostconvenient way to do it is to inherit it from BaseEstimator. It requires us toimplement the following three methods:• get_feature_names(): This returns a list of strings of the features that wewill return in transform().• fit(document, y=None): As we are not implementing a classifier, we canignore this one and simply return self.• transform(documents): This returns numpy.array(), containing an array ofshape (len(documents), len(get_feature_names)). This means, for everydocument in documents, it has to return a value for every feature name inget_feature_names().[ 152 ]Chapter 6Here is the implementation:sent_word_net = load_sent_word_net()class LinguisticVectorizer(BaseEstimator):def get_feature_names(self):return np.array(['sent_neut', 'sent_pos', 'sent_neg','nouns', 'adjectives', 'verbs', 'adverbs','allcaps', 'exclamation', 'question', 'hashtag','mentioning'])# we don't fit here but need to return the reference# so that it can be used like fit(d).transform(d)def fit(self, documents, y=None):return selfdef _get_sentiments(self, d):sent = tuple(d.split())tagged = nltk.pos_tag(sent)pos_vals = []neg_vals = []nouns = 0.adjectives = 0.verbs = 0.adverbs = 0.for w,t in tagged:p, n = 0,0sent_pos_type = Noneif t.startswith("NN"):sent_pos_type = "n"nouns += 1elif t.startswith("JJ"):sent_pos_type = "a"adjectives += 1[ 153 ]Classification II – Sentiment Analysiselif t.startswith("VB"):sent_pos_type = "v"verbs += 1elif t.startswith("RB"):sent_pos_type = "r"adverbs += 1if sent_pos_type is not None:sent_word = "%s/%s" % (sent_pos_type, w)if sent_word in sent_word_net:p,n = sent_word_net[sent_word]pos_vals.append(p)neg_vals.append(n)l = len(sent)avg_pos_val = np.mean(pos_vals)avg_neg_val = np.mean(neg_vals)return [1-avg_pos_val-avg_neg_val, avg_pos_val, avg_neg_val,nouns/l, adjectives/l, verbs/l, adverbs/l]def transform(self, documents):obj_val, pos_val, neg_val, nouns, adjectives, \verbs, adverbs = np.array([self._get_sentiments(d) \for d in documents]).Tallcaps = []exclamation = []question = []hashtag = []mentioning = []for d in documents:allcaps.append(np.sum([t.isupper() \[ 154 ]Chapter 6for t in d.split() if len(t)>2]))exclamation.append(d.count("!"))question.append(d.count("?"))hashtag.append(d.count("#"))mentioning.append(d.count("@"))result = np.array([obj_val, pos_val, neg_val, nouns, adjectives,verbs, adverbs, allcaps, exclamation, question,hashtag, mentioning]).Treturn resultPutting everything togetherNevertheless, using these linguistic features in isolation without the wordsthemselves will not take us very far.
Therefore, we have to combine theTfidfVectorizer parameter with the linguistic features. This can be done withscikit-learn's FeatureUnion class. It is initialized in the same manner as Pipeline;however, instead of evaluating the estimators in a sequence each passing the outputof the previous one to the next one, FeatureUnion does it in parallel and joins theoutput vectors afterwards.def create_union_model(params=None):def preprocessor(tweet):tweet = tweet.lower()for k in emo_repl_order:tweet = tweet.replace(k, emo_repl[k])for r, repl in re_repl.items():tweet = re.sub(r, repl, tweet)return tweet.replace("-", " ").replace("_", " ")tfidf_ngrams = TfidfVectorizer(preprocessor=preprocessor,analyzer="word")ling_stats = LinguisticVectorizer()all_features = FeatureUnion([('ling', ling_stats), ('tfidf',tfidf_ngrams)])[ 155 ]Classification II – Sentiment Analysisclf = MultinomialNB()pipeline = Pipeline([('all', all_features), ('clf', clf)])if params:pipeline.set_params(**params)return pipelineTraining and testing on the combined featurizers, gives another 0.4 percentimprovement on average P/R AUC for positive versus negative:== Pos vs.
neg ==0.8100.0230.8900.025== Pos/neg vs. irrelevant/neutral ==0.7910.0070.6910.022== Pos vs. rest ==0.8900.0110.5290.035== Neg vs. rest ==0.8830.0070.6170.033time spent: 214.12578797340393With these results, we probably do not want to use the positive versus rest andnegative versus rest classifiers, but instead use first the classifier determiningwhether the tweet contains sentiment at all (pos/neg versus irrelevant/neutral)and then, in case it does, use the positive versus negative classifier to determinethe actual sentiment.SummaryCongratulations for sticking with us until the end! Together we have learned howNaïve Bayes works and why it is not that naïve at all. Especially, for training sets,where we don't have enough data to learn all the niches in the class probabilityspace, Naïve Bayes does a great job of generalizing. We learned how to apply it totweets and that cleaning the rough tweets' texts helps a lot.
Finally, we realized thata bit of "cheating" (only after we have done our fair share of work) is okay. Especiallywhen it gives another improvement of the classifier's performance, as we haveexperienced with the use of SentiWordNet.In the next chapter, we will look at regressions.[ 156 ]RegressionYou probably learned about regression in your high school mathematics class. Thespecific method you learned was probably what is called ordinary least squares(OLS) regression. This 200-year-old technique is computationally fast and can beused for many real-world problems. This chapter will start by reviewing it andshowing you how it is available in scikit-learn.For some problems, however, this method is insufficient.
This is particularly truewhen we have many features, and it completely fails when we have more featuresthan datapoints. For those cases, we need more advanced methods. These methodsare very modern, with major developments happening in the last decade. They go bynames such as Lasso, Ridge, or ElasticNets. We will go into these in detail.
They arealso available in scikit-learn.Predicting house prices with regressionLet's start with a simple problem, predicting house prices in Boston; a problem forwhich we can use a publicly available dataset. We are given several demographicand geographical attributes, such as the crime rate or the pupil-teacher ratio in theneighborhood. The goal is to predict the median value of a house in a particular area.As usual, we have some training data, where the answer is known to us.This is one of the built-in datasets that scikit-learn comes with, so it is very easy toload the data into memory:>>> from sklearn.datasets import load_boston>>> boston = load_boston()The boston object contains several attributes; in particular, boston.data containsthe input data and boston.target contains the price of houses.[ 157 ]RegressionWe will start with a simple one-dimensional regression, trying to regress the priceon a single attribute, the average number of rooms per dwelling in the neighborhood,which is stored at position 5 (you can consult boston.DESCR and boston.feature_names for detailed information on the data):>>> from matplotlib import pyplot as plt>>> plt.scatter(boston.data[:,5], boston.target, color='r')The boston.target attribute contains the average house price (our target variable).We can use the standard least squares regression you probably first saw in high-school.Our first attempt looks like this:>>> from sklearn.linear_model import LinearRegression>>> lr = LinearRegression()We import LinearRegression from the sklearn.linear_model module andconstruct a LinearRegression object.
This object will behave analogously tothe classifier objects from scikit-learn that we used earlier.>>> x = boston.data[:,5]>>> y = boston.target>>> x = np.transpose(np.atleast_2d(x))>>> lr.fit(x, y)>>> y_predicted = lr.predict(x)The only nonobvious line in this code block is the call to np.atleast_2d, whichconverts x from a one-dimensional to a two-dimensional array. This conversion isnecessary as the fit method expects a two-dimensional array as its first argument.Finally, for the dimensions to work out correctly, we need to transpose this array.Note that we are calling methods named fit and predict on the LinearRegressionobject, just as we did with classifier objects, even though we are now performingregression.