Summary (1137510), страница 3
Текст из файла (страница 3)
part of speech, gender and animacy fornouns), mutable morphological features (mutable_gr , e.g. case and number for nouns)● mixed trigrams with templates:○ (lemmas, immutable_gr, immutable_gr) ,○ (immutable_gr, lemmas, immutable_gr) ,○ (immutable_gr, immutable_gr, lemmas).For each feature type it's frequency list was built and only the top 200 most frequentwere used. This restriction follows two aims: both to alleviate the dimensionality curse and toreduce amount of noise features. Vector representation of features for clustering and analysiswas produced using bag-of-words model for the first three words concatenated with theaverage for the whole definition.Classical n-gram is a n-tuple of sequential elements from a list. Similarly, given ndifferent lists let us define mixed n-gram as a n-tuple of sequential elements from the list,each element in the tuple corresponds both to sequential list and to sequential position in thelist. In the linguistic domain let us call the set of lists used n-gram template.To assess the quality of the rules obtained, we grouped clusters by relation that can beextracted from each cluster and counted number of definitions in the group and combinedestimate of precision.For each group of clusters an overall number of definitions:Russian WordNetrelationamountprecisionOnto.PTamountprecisionhypernym5324685.54%29,56359.10%synonym1004475.69%11,86286.10%junk7175100.00%hypernym synonym416076.11%hyponym276153.71%part of1017100.00%1,28752.60%domain49551.72%instance of25361.26%25361.26%hypernym hypernym125100.00%has part10592.38%5862183.93%3789876.64%dictionaryTable 2.
Estimate on number of extracted relations and extraction precision ascompared to Onto.PT.The chapter contains the results of the experiment and their discussion. It is obviousthat clusterization helps improve the quality of relation extraction.***Chapter IV «Word sense disambiguation (WSD)» presents two experiments onautomatic disambiguation of annotated words. It gives an overview of existingdisambiguation alorithms, which can be divided into three classes:1. algorithms based on simple heuristics;2. algorithms based on machine learning;3. algorithms using distributional semantics models.Special attention was given to two groups of algorithms:1. Lesk algorithm and its modifications2. algorithms using results obtained by predicting vector models Word2Vec and AdaGramas features for machine learning.The chapter analyses the possibility to apply the algorithms that we chose forthesaurus relation extraction from definition corpora and their potential modifications thatcould improve WSD results.
Here we focus on hypernym and hyponym relations.The first part focuses on Lesk algorithm and its modifications. We set an experimentwhich tests different approaches to feature extraction, weight modifications and options toimprove the results with the help of Serelex – a word associations database.We have developed a pipeline for massively testing different disambiguation setups.The pipeline is preceded by obtaining common data: word lemmas, morphologicalinformation, word frequency. For the pipeline we broke down the task of disambiguation intosteps. For each step we presented several alternative implementations.
These are:● Represent candidate hyponym-hypernym sense pair as a Cartesian product of list ofwords in hyponym sense and list of words in hypernym sense, repeats retained.● Calculate numerical metric of words similarity. This is the point we strive to improve. Asa baseline we used: random number, inverse dictionary definition number; classic Leskalgorithm.
We also introduce several new metrics described below.● Apply compensation function for word frequency. We assume that coincidence offrequent words in to definitions gives us much less information about their relatednessthan coincidence of infrequent words. We try the following compensation functions: nocompensation, divide by logarithm of word frequency, divide by word frequency.● Apply non-parametric normalization function to similarity measure.
Some of the metricsproduce values with very large variance. This leads to situations where one matching pairof words outweighs a lot of outright mismatching pairs. To mitigate this we attempted toapply these functions to reduce variance: linear (no normalization), logarithm, Gaussian,and logistic curve.● Apply adjustment function to prioritize the first noun in each definition. While extractingcandidate hypernyms the algorithm retained up to three candidate nouns in each article.Our hypothesis states that the first one is most likely the hypernym.
We apply penalty tothe metric depending on candidate hypernym position within hyponym definition. Wetested the following penalties: no penalty, divide by word number, divide by exponent ofword number.● Aggregate weights of individual pairs of words. We test two aggregation functions:average weight and sum of best N weights. In the last case we repeat the sequence ofweights if there were less than N pairs.
We also tested the following values of N: 2, 4, 8,16, 32.● Algorithm returns candidate hypernym with the highest score.● The data for these experiments comes from the BRED dictionary described in Chapter IIearlier.For testing the algorithms we selected words in several domains for manual markup.We determined domain as a connected component in a graph of word senses and hypernymsproduced by one of the algorithms. Each annotator was given the task to disambiguate everysense for every word in such domain.
Given a triplet an annotator assigns either nohypernyms or one hypernym; in exceptional cases assigning two hypernyms for a sense isallowed.One domain with 175 senses defining 90 nouns and noun phrases was given to twoannotators to estimate inter-annotator agreement. Both annotators assigned 145 hypernymswithin the set. Of those only 93 matched, resulting in 64% inter-annotator agreement. The 93identically assigned hyponym-hypernym pairs were used as a core dataset for testing results.Additional 300 word senses were marked up to verify the results on larger datasets.
Thealgorithms described were tested on both of the datasets.One known problem with Lesk algorithm is that it uses only word co-occurrencewhen calculating overlap rate (Basile et al., 2004) and does not extract information fromsynonyms or inflected words. In our test it worked surprisingly well on the dictionary corpus,finding twice as many correct hypernym senses as the random baseline. We strive to improvethat result for dictionary definition texts.Russian language has rich word derivation through variation of word suffixes.
Thefirst obvious enhancement to Lesk algorithm to account for this is to assign similarity scoresto words based on length of common prefix. In the results we refer to this metric as advancedLesk.Another approach to enhance Lesk algorithm is to detect cases where two differentwords are semantically related. To this end we picked up a database of word associationsSerelex (Panchenko et al, 2013). It assigns a score on a 0 to infinity scale to a pair of nounlemmas roughly describing their semantic similarity. As a possible way to score words thatare not nouns in Serelex we truncate a few characters off the ends of both words and searchfor the best pair matching the prefixes in Serelex.
(See “prefix serelex” in the Table).We tested several hypotheses on how these two metrics can be used to improve theresulting performance. The tests were: to use only Lesk; to use only Serelex; to use Serelexwhere possible and fallback to advanced Lesk for cases where no answer was available; andto sum the results of Serelex and Lesk. Since Serelex has a specific distribution of scores weadjusted the advanced Lesk score to produce similar distribution.For each estimator we performed full search through available variations on steps 3-6of the pipeline and selected the best on the core set and estimated again on the larger dataset.Test results are given in the Table:AlgorithmCoreSetLargeSetrandom30.8%23.9%first sense38.7%37.7%naive Lesk51.6%41.3%serelex49.5%38.0%advanced Lesk53.8%33.3%serelex with adjusted Lesk fallback52.7%36.3%serelex + adjusted Lesk52.7%38.3%prefix serelex53.8%38.0%Precision of different WSD algorithms.The second part explores different methods using distributive semantic models.
Thissection gives a brief review of the history of distributive semantics methods: at the momentof writing this text these methods performed best in disambiguation tasks. The review endsup grounding the choice of two distributional models and the methods of their application forsolving disambiguation tasks.Further in the text, we give a thorough description of an experiment that comparesdifferent approaches to extract WSD features from dictionary definitions and several machinelearning algorithms. The main task of the experiment is whether using unmarked and taggeddata improves WSD results.