pymorphy2 (1185429), страница 2

Файл №1185429 pymorphy2 (Аннотации) 2 страницаpymorphy2 (1185429) страница 22020-08-252020-08-25СтудИзба

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 2)

It is not how pymorphy2 works. Todo the task eﬃciently, pymorphy2 exploits DAFSA [5] dictionary structure: theresult is built by traversing the word character graph and trying to follow "ё"transitions in addition to "е" transitions (for Russian) and "ґ" transitions inaddition to "г" transitions (for Ukrainian).4Analysis of Out-of-Vocabulary WordsIt is not practical to try incorporate all the words in a lexicon - there is a longtail of rarely used words, new words appear; there is morphological derivation,loanwords, it is challenging to add all names, locations and special terms to thedictionary. Empirically, Zipf’s Law seems to hold for natural languages [14]; oneof the consequences is that even doubling the size of a lexicon could increase thecoverage only slightly [6].For languages without rich morphology it may be practical to assume thatif word is not in a dictionary then it can be of any class from the open wordclasses, and then disambiguate the results on later processing stages, using e.g.

acontextual POS tagger or a syntactic parser. For Slavic languages doing this onlater stages is challenging because of large tagsets - for example, OpenCorpora[3] words have more than 4500 diﬀerent tags. Morphological analyzer solves itby limiting the number of possible analyses based on word shape.pymorphy2 uses a set of rules (analyzer units) to handle unknown words.Some of the rules are described in literature [8,9,10,6,4]; the resulting combination is novel.

The order of in which the rules are applied is language-speciﬁc.4.1Common Preﬁxes RemovalThere is a set of immutable preﬁxes which can be attached to words of openclasses (nouns, verbs, adjectives, adverbs, participles, gerunds) without aﬀectingthe word grammatical properties. Examples of such preﬁxes for Russian: "не","псевдо", "авиа"; pymorphy2 provides language-speciﬁc lists of these preﬁxes.When a words starts with one of these preﬁxes, pymorphy2 removes thepreﬁx, parses the reminder and re-attaches the preﬁx. A similar rule is describedin [8].

Note that full analysis is performed on the reminder, so the reminder canbe an out-of-vocabulary word itself. To speedup preﬁx matching built-in lists ofpreﬁxes are encoded to DAFSAs.4.2Words Ending with Other Dictionary WordsWhen all the following apply pymorphy2 assumes the whole word can be parsedthe same way as the "suﬃx" word:––––a word being analyzed has another word from a dictionary as a suﬃx;the length of this "suﬃx" word is greater than 3;the length of the word without the "suﬃx" is no greater than 5;"suﬃx" word is of an open class (noun, verb, adjective, participle, gerund)To search for suﬃxes pymorphy2 tries to consider 1st letter as a preﬁx, thentwo ﬁrst letters as a preﬁx, etc., and lookups the reminder in a dictionary.This rule is the same as described in [10]. A similar rule is described in [8],though its induction for concrete preﬁxes is diﬀerent.4.3Endings MatchingIn many languages, including Russian and Ukrainian, words with common endings often have the same grammatical form.To exploit this, pymorphy2 ﬁrst collects the information from the dictionary:for each word all endings of length 1 to 5 are extracted, and all possible analysesfor these endings are stored.

Then this ending → {analyses} mapping is cleanedup:– only the most frequent analyses for each POS tag are kept;– analyses from non-productive paradigms (currently these are paradigms whichproduced less than 3 lexemes in a dictionary) are discarded;– rare endings (currently the ones which occur once) are also discarded.The resulting mapping is encoded to DAFSA for fast lookups. Storage schemeis the following: < ending > SEP < analysisInf o >, where analysisInf o consists of three 2-byte numbers: (f requency, paradigmId, f ormIndex) - analysisfrequency (a number of times a word with this ending had this analysis), ID ofanalysis paradigm and the form index inside the paradigm.At prediction time pymorphy2 checks word endings from length 5 to 1, stopping at the ﬁrst ending with some analyses found.

To get possible analyses for agiven ending pymorphy2 ﬁrst follows all DAFSA transitions for the ending, thenfollows a separator, and then traverses the remaining subtree to get possibleanalysisInf o triples. The result is then sorted by analyses frequencies.Recall that a word and a (paradigmId, f ormIndex) pair is all what isneeded to restore the lexeme and inﬂect the word. Lexemes are created onﬂy, so it doesn’t matter word is not from the vocabulary as soon as we have(paradigmId, f ormIndex) pair. It means morphological generation (lemmatization, inﬂection) works here.Only analyses with open-class parts of speech (noun, verb, adjective, participle, gerund) are produced.

Special care is taken to handle "ё" letter properly.Also, special care is required to handle paradigm preﬁxes properly - in fact, thereare several ending → {analyses} DAFSAs built, one per each paradigm preﬁx.This rule is based on [10]; similar approaches are also used in [4] and [9]. [8]uses similar rules, but derives them diﬀerently.4.4Words with a HyphenUnlike some other morphological analyzers, pymorphy2 opts to handle wordswith a hyphen.In [7] it is argued that in most cases the parts of compound words should behandled as separate words if they are joined using a hyphen.

A similar decisionis made in OpenCorpora tokenization module [2]; it considers words like "ЖанПоль" as three tokens which should be analyzed separately and joined backat later processing stages. In both cases the decisions are not motivated bylinguistic considerations; it is the technical diﬃculty which prevents analyzingand processing such words as single entities.Currently pymorphy2 handles adverbs with a hyphen, particles separated bya hyphen and compound words with left and right parts separated by a hyphen.Adverbs with a Hyphen Russian words are parsed as adverbs if they– start with a "по-" preﬁx;– have total length greater than 5;– can be parsed as a full singular adjective in dative case when "по-" is removedExamples: "по-северному", "по-хорошему".Particles Separated by a Hyphen Though it is not clear if words with aparticle separated by a hyphen (e.g.

"смотри-ка" or "посмотрел-таки") shouldbe handled as a single word or as two words, pymorphy2 supports parsing ofsuch words. There are language-speciﬁc lists of common particles which can beattached, and if a word ends with one of these particles then it is parsed withoutthe particle, and then the particle is re-attached to the result.Compound Words with a Hyphen The main challenge in analysis of thecompound words which parts are separated by a hyphen (like "человек-паук"and "Царь-пушка") is to ﬁgure out if the left part should be inﬂected togetherwith the right part, or if it is a ﬁxed preﬁx.To do this, pymorphy2 parses left and right parts separately (they don’t haveto be dictionary words).

Then it tries to ﬁnd matching analyses. If there is a"left" analysis compatible with one of the "right" analyses then the resultinganalysis is built where both word parts are inﬂected. After that, an analysiswith a ﬁxed left part is added to the result, regardless of whether a compatible"left" analysis was found or not. A similar method was used in [4].Only words with a single hyphen are handled using heuristics describedabove.

Words with multiple hyphens are likely represent diﬀerent phenomenain Russian and Ukrainian languages; they could be interjections or phrases [13].4.5Other TokensInitial is an abbreviation of person’s ﬁrst or patronymic name. In most cases aninitial is a single upper-cased character (language-speciﬁc). pymorphy2 parsessuch characters as ﬁxed singular nouns, with variants for all possible gender andcase combinations. For person ﬁrst names (N ame) two diﬀerent lexemes arebuilt for male and female names.

For patronymic names (P atr) a single lexemeis returned. Unlike all other analyzer rules, detection of initials is case-sensitive.It is a way to decrease ambiguity.The following tags are assigned to non-lexical tokens: P N CT for punctuation,LAT N for tokens written in Latin alphabet, N U M B, intg for integer numbers,N U M B, real for ﬂoating-point numbers, ROM N for Roman numbers.When analyzing the text, it is common to classify tokens during the tokenization step. The reason pymorphy2 handles non-lexical tokens during themorphological analysis step is that this allows users to use a simpler tokenizerwhen classifying tokenizer is not available; also, it means that information aboutall tokens is available in a common format.4.6Morphological Generation of Out of Vocabulary WordsInﬂection is fully supported for out of vocabulary words. To achieve this pymorphy2 keeps track of the analyzer units (rules and their parameters) used to parsethe word, requires each analyzer unit to provide a method for getting a lexeme,and calls this method for the last analyzer unit.

To compute the lexeme analyzerunit can look at the analysis result, and it can ask previous analyzer units forthe lexeme.For example, Common Preﬁxes Removal analyzer removes the preﬁx froma word, then gets a lexeme from the previous analyzer, and then attaches thepreﬁx to each word form in a lexeme to build a resulting lexeme.5Probability EstimationMorphological analyzer may return multiple possible word parses. The problemof choosing the right analysis from a list of possible options is called disambiguation. Generally, to select the correct analysis it is required to take wordcontext in account. Morphological analyzer takes individual words as an input,so it can’t disambiguate the result robustly.

However, it can provide an estimation for P (analysis|word) conditional probability. Such probability estimationscan be used in absence of a dedicated disambiguator to select the more probableanalysis. In addition to that, these probabilities can be used on later stages oftext analysis, for example by a disambiguator.To estimate P (analysis|word) conditional probability for Russian words pymorphy2 uses partially disambiguated OpenCorpora corpus [3] and assumes thatP (analysis|word) = P (tag|word).

The conditional probability is estimated forwords which have multiple analysis according to pymorphy2, but have occurrences with a single remaining analysis in the OpenCorpora corpus; the estimation is a maximum-likelihood estimation with Laplace (add-one) smoothing.Wdisambiguated := {word : |tagscorpus (word)| = 1, word ∈ corpus}Wambiguous := {word : |tagspymorphy2 (word)| > 1, word ∈ Wdisambiguated }B(word) = max(|tagspymorphy2 (word)|, |tagscorpus (word)|)∀word ∈ Wambiguous ,∀tag ∈ tagspymorphy2 (word) :PMLE (tag|word) =count(word, tag) + 1count(word) + B(word)(1)Counts are computed based on OpenCorpora corpus data; all words with asingle remaining analysis are taken in account.Once estimated, the result is stored on disk as a DAFSA; keys are< word >:< tag >< N U LL >< int(106 ∗ PMLE (tag|word)) >For words without PMLE (tag|word) estimates the probabilities are assigneduniformly during the parsing.For Ukrainian language probabilities are assigned uniformly because at themoment of writing there is no a freely available Ukrainian corpus similar toOpenCorpora.6EvaluationEvaluating analysis quality of diﬀerent morphological analyzers for Russian isnot straightforward because most analyzers (as well as annotated corpora) usetheir own incompatible tagsets.

And when a corpus and a dictionary have acompatible tagset it usually means that the dictionary was enhanced from thecorpus, and it is a problem because quality numbers obtained on a corpus thedictionary was enhanced from shouldn’t be relied on - they are too optimistic.pymorphy2 analysis quality was compared to an analysis quality of a wellknown morphological analyzer13, Mystem 3.0 [9]. Testing corpus consists of 100randomly selected sentences (1405 tokens) from OpenCorpora (microcorpus14 )and 100 randomly selected sentences (1093 tokens) from ruscorpora.ru - 2498manually disambiguated tokens in total.Full details for this evaluation can be found online15 .OpenCorpora (pymorphy2) tagset is not the same as ruscorpora.ru tagset,and ruscorpora.ru tagset diﬀers from Mystem tagset.

Характеристики

Тип файла

PDF-файл

Размер

190,08 Kb

Материал

Аннотации

Тип материала

Другое

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов учебной работы

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.