pymorphy2 (1185429), страница 2
Текст из файла (страница 2)
It is not how pymorphy2 works. Todo the task efficiently, pymorphy2 exploits DAFSA [5] dictionary structure: theresult is built by traversing the word character graph and trying to follow "ё"transitions in addition to "е" transitions (for Russian) and "ґ" transitions inaddition to "г" transitions (for Ukrainian).4Analysis of Out-of-Vocabulary WordsIt is not practical to try incorporate all the words in a lexicon - there is a longtail of rarely used words, new words appear; there is morphological derivation,loanwords, it is challenging to add all names, locations and special terms to thedictionary. Empirically, Zipf’s Law seems to hold for natural languages [14]; oneof the consequences is that even doubling the size of a lexicon could increase thecoverage only slightly [6].For languages without rich morphology it may be practical to assume thatif word is not in a dictionary then it can be of any class from the open wordclasses, and then disambiguate the results on later processing stages, using e.g.
acontextual POS tagger or a syntactic parser. For Slavic languages doing this onlater stages is challenging because of large tagsets - for example, OpenCorpora[3] words have more than 4500 different tags. Morphological analyzer solves itby limiting the number of possible analyses based on word shape.pymorphy2 uses a set of rules (analyzer units) to handle unknown words.Some of the rules are described in literature [8,9,10,6,4]; the resulting combination is novel.
The order of in which the rules are applied is language-specific.4.1Common Prefixes RemovalThere is a set of immutable prefixes which can be attached to words of openclasses (nouns, verbs, adjectives, adverbs, participles, gerunds) without affectingthe word grammatical properties. Examples of such prefixes for Russian: "не","псевдо", "авиа"; pymorphy2 provides language-specific lists of these prefixes.When a words starts with one of these prefixes, pymorphy2 removes theprefix, parses the reminder and re-attaches the prefix. A similar rule is describedin [8].
Note that full analysis is performed on the reminder, so the reminder canbe an out-of-vocabulary word itself. To speedup prefix matching built-in lists ofprefixes are encoded to DAFSAs.4.2Words Ending with Other Dictionary WordsWhen all the following apply pymorphy2 assumes the whole word can be parsedthe same way as the "suffix" word:––––a word being analyzed has another word from a dictionary as a suffix;the length of this "suffix" word is greater than 3;the length of the word without the "suffix" is no greater than 5;"suffix" word is of an open class (noun, verb, adjective, participle, gerund)To search for suffixes pymorphy2 tries to consider 1st letter as a prefix, thentwo first letters as a prefix, etc., and lookups the reminder in a dictionary.This rule is the same as described in [10]. A similar rule is described in [8],though its induction for concrete prefixes is different.4.3Endings MatchingIn many languages, including Russian and Ukrainian, words with common endings often have the same grammatical form.To exploit this, pymorphy2 first collects the information from the dictionary:for each word all endings of length 1 to 5 are extracted, and all possible analysesfor these endings are stored.
Then this ending → {analyses} mapping is cleanedup:– only the most frequent analyses for each POS tag are kept;– analyses from non-productive paradigms (currently these are paradigms whichproduced less than 3 lexemes in a dictionary) are discarded;– rare endings (currently the ones which occur once) are also discarded.The resulting mapping is encoded to DAFSA for fast lookups. Storage schemeis the following: < ending > SEP < analysisInf o >, where analysisInf o consists of three 2-byte numbers: (f requency, paradigmId, f ormIndex) - analysisfrequency (a number of times a word with this ending had this analysis), ID ofanalysis paradigm and the form index inside the paradigm.At prediction time pymorphy2 checks word endings from length 5 to 1, stopping at the first ending with some analyses found.
To get possible analyses for agiven ending pymorphy2 first follows all DAFSA transitions for the ending, thenfollows a separator, and then traverses the remaining subtree to get possibleanalysisInf o triples. The result is then sorted by analyses frequencies.Recall that a word and a (paradigmId, f ormIndex) pair is all what isneeded to restore the lexeme and inflect the word. Lexemes are created onfly, so it doesn’t matter word is not from the vocabulary as soon as we have(paradigmId, f ormIndex) pair. It means morphological generation (lemmatization, inflection) works here.Only analyses with open-class parts of speech (noun, verb, adjective, participle, gerund) are produced.
Special care is taken to handle "ё" letter properly.Also, special care is required to handle paradigm prefixes properly - in fact, thereare several ending → {analyses} DAFSAs built, one per each paradigm prefix.This rule is based on [10]; similar approaches are also used in [4] and [9]. [8]uses similar rules, but derives them differently.4.4Words with a HyphenUnlike some other morphological analyzers, pymorphy2 opts to handle wordswith a hyphen.In [7] it is argued that in most cases the parts of compound words should behandled as separate words if they are joined using a hyphen.
A similar decisionis made in OpenCorpora tokenization module [2]; it considers words like "ЖанПоль" as three tokens which should be analyzed separately and joined backat later processing stages. In both cases the decisions are not motivated bylinguistic considerations; it is the technical difficulty which prevents analyzingand processing such words as single entities.Currently pymorphy2 handles adverbs with a hyphen, particles separated bya hyphen and compound words with left and right parts separated by a hyphen.Adverbs with a Hyphen Russian words are parsed as adverbs if they– start with a "по-" prefix;– have total length greater than 5;– can be parsed as a full singular adjective in dative case when "по-" is removedExamples: "по-северному", "по-хорошему".Particles Separated by a Hyphen Though it is not clear if words with aparticle separated by a hyphen (e.g.
"смотри-ка" or "посмотрел-таки") shouldbe handled as a single word or as two words, pymorphy2 supports parsing ofsuch words. There are language-specific lists of common particles which can beattached, and if a word ends with one of these particles then it is parsed withoutthe particle, and then the particle is re-attached to the result.Compound Words with a Hyphen The main challenge in analysis of thecompound words which parts are separated by a hyphen (like "человек-паук"and "Царь-пушка") is to figure out if the left part should be inflected togetherwith the right part, or if it is a fixed prefix.To do this, pymorphy2 parses left and right parts separately (they don’t haveto be dictionary words).
Then it tries to find matching analyses. If there is a"left" analysis compatible with one of the "right" analyses then the resultinganalysis is built where both word parts are inflected. After that, an analysiswith a fixed left part is added to the result, regardless of whether a compatible"left" analysis was found or not. A similar method was used in [4].Only words with a single hyphen are handled using heuristics describedabove.
Words with multiple hyphens are likely represent different phenomenain Russian and Ukrainian languages; they could be interjections or phrases [13].4.5Other TokensInitial is an abbreviation of person’s first or patronymic name. In most cases aninitial is a single upper-cased character (language-specific). pymorphy2 parsessuch characters as fixed singular nouns, with variants for all possible gender andcase combinations. For person first names (N ame) two different lexemes arebuilt for male and female names.
For patronymic names (P atr) a single lexemeis returned. Unlike all other analyzer rules, detection of initials is case-sensitive.It is a way to decrease ambiguity.The following tags are assigned to non-lexical tokens: P N CT for punctuation,LAT N for tokens written in Latin alphabet, N U M B, intg for integer numbers,N U M B, real for floating-point numbers, ROM N for Roman numbers.When analyzing the text, it is common to classify tokens during the tokenization step. The reason pymorphy2 handles non-lexical tokens during themorphological analysis step is that this allows users to use a simpler tokenizerwhen classifying tokenizer is not available; also, it means that information aboutall tokens is available in a common format.4.6Morphological Generation of Out of Vocabulary WordsInflection is fully supported for out of vocabulary words. To achieve this pymorphy2 keeps track of the analyzer units (rules and their parameters) used to parsethe word, requires each analyzer unit to provide a method for getting a lexeme,and calls this method for the last analyzer unit.
To compute the lexeme analyzerunit can look at the analysis result, and it can ask previous analyzer units forthe lexeme.For example, Common Prefixes Removal analyzer removes the prefix froma word, then gets a lexeme from the previous analyzer, and then attaches theprefix to each word form in a lexeme to build a resulting lexeme.5Probability EstimationMorphological analyzer may return multiple possible word parses. The problemof choosing the right analysis from a list of possible options is called disambiguation. Generally, to select the correct analysis it is required to take wordcontext in account. Morphological analyzer takes individual words as an input,so it can’t disambiguate the result robustly.
However, it can provide an estimation for P (analysis|word) conditional probability. Such probability estimationscan be used in absence of a dedicated disambiguator to select the more probableanalysis. In addition to that, these probabilities can be used on later stages oftext analysis, for example by a disambiguator.To estimate P (analysis|word) conditional probability for Russian words pymorphy2 uses partially disambiguated OpenCorpora corpus [3] and assumes thatP (analysis|word) = P (tag|word).
The conditional probability is estimated forwords which have multiple analysis according to pymorphy2, but have occurrences with a single remaining analysis in the OpenCorpora corpus; the estimation is a maximum-likelihood estimation with Laplace (add-one) smoothing.Wdisambiguated := {word : |tagscorpus (word)| = 1, word ∈ corpus}Wambiguous := {word : |tagspymorphy2 (word)| > 1, word ∈ Wdisambiguated }B(word) = max(|tagspymorphy2 (word)|, |tagscorpus (word)|)∀word ∈ Wambiguous ,∀tag ∈ tagspymorphy2 (word) :PMLE (tag|word) =count(word, tag) + 1count(word) + B(word)(1)Counts are computed based on OpenCorpora corpus data; all words with asingle remaining analysis are taken in account.Once estimated, the result is stored on disk as a DAFSA; keys are< word >:< tag >< N U LL >< int(106 ∗ PMLE (tag|word)) >For words without PMLE (tag|word) estimates the probabilities are assigneduniformly during the parsing.For Ukrainian language probabilities are assigned uniformly because at themoment of writing there is no a freely available Ukrainian corpus similar toOpenCorpora.6EvaluationEvaluating analysis quality of different morphological analyzers for Russian isnot straightforward because most analyzers (as well as annotated corpora) usetheir own incompatible tagsets.
And when a corpus and a dictionary have acompatible tagset it usually means that the dictionary was enhanced from thecorpus, and it is a problem because quality numbers obtained on a corpus thedictionary was enhanced from shouldn’t be relied on - they are too optimistic.pymorphy2 analysis quality was compared to an analysis quality of a wellknown morphological analyzer13, Mystem 3.0 [9]. Testing corpus consists of 100randomly selected sentences (1405 tokens) from OpenCorpora (microcorpus14 )and 100 randomly selected sentences (1093 tokens) from ruscorpora.ru - 2498manually disambiguated tokens in total.Full details for this evaluation can be found online15 .OpenCorpora (pymorphy2) tagset is not the same as ruscorpora.ru tagset,and ruscorpora.ru tagset differs from Mystem tagset.