pymorphy2 (1185429)

Файл №1185429 pymorphy2 (Аннотации)pymorphy2 (1185429)2020-08-252020-08-25СтудИзба

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла

Morphological Analyzer and Generator forRussian and Ukrainian LanguagesMikhail KorobovScrapingHub, Inc., Ekaterinburg, Russiakmike84@gmail.com1IntroductionMorphological analysis is an analysis of internal structure of words. For languageswith rich morphology like Russian or Ukrainian using the morphological analysisit is possible to ﬁgure out if a word can be a noun or a verb, or if it can besingular or plural.

Morphological analysis is in an important step of naturallanguage processing pipelines for such languages.Morphological generation is a process of building a word given its grammatical representation; this includes lemmatization, inﬂection and ﬁnding wordlexemes.pymorphy2 is a morphological analyzer and generator for Russian and Ukrainianlanguage widely used in industry and in academia. It is being developed since2012; Ukrainian support is a recent addition. The development of its predecessor,pymorphy1 started in 2009.

The package is available2 under a permissive license(MIT), and it uses open source permissively licensed dictionary data.The rest of this paper is organized as follows. In Section 2 pymorphy2 software architecture and design principles are described. Section 3 explains how12https://bitbucket.com/kmike/pymorphyhttps://github.com/kmike/pymorphy2pymorphy2 uses lexicons and how analysis and morphological generation workfor vocabulary words. In Section 4 methods used for out-of-vocabulary words areexplained and compared with approaches used by other morphological analyzers.Section 5 is dedicated to a problem of selecting correct analysis from all possibleanalyses, and a role of morphological analyzer in this task.

In Section 6 evaluation results are presented. Section 7 outlines a roadmap for future pymorphy2improvements.2Software Architecturepymorphy2 is implemented as a cross-platform Python3 library, with a commandline utility and optional C++ extensions for faster analysis. Both Python 2.x andPython 3.x are supported. An extensive testing suite (600+ unit tests) ensuresthe code quality; test coverage is kept above 90%. There is online documentation4available.When optional C++ extension is used (or when pymorphy2 is executed usingPyPy5 Python interpreter) the parsing speed is usually in tens of thousands ofwords per second; in some speciﬁc cases in can exceed 100000 words per secondin a single thread.

Without the extension parsing speed is in thousands of wordsper second. The memory consumption is about 15MB, or about 30MB if weaccount for Python interpreter itself.Users are provided with a simple API for working with words, their analysesand grammatical tags. There are methods to analyze words, inﬂect and lemmatize them, build word lexemes, make words agree with a number, methods forworking with tags, grammemes and dictionaries. Inherent complexity of workingwith natural languages is not hidden from the user.

For example, to lemmatize the word correctly it is necessary to choose the correct analysis from a listof possible analyses; pymorphy2 provides P (analysis|word) estimates and sortsthe results accordingly, but requires user to choose the analysis explicitly beforenormalizing the word.Analysis of vocabulary words and out-of-vocabulary words is uniﬁed.

Thereis a conﬁgurable pipeline of "analyzer units"; it contains a unit for vocabularywords analysis and units (rules) for out-of-vocabulary words handling. Individual units can be customized or turned oﬀ; some rules are parametrized withlanguage-speciﬁc data. Users can create their own analyzer units (rules).

Thisall makes it possible to perform morphological analysis experiments withoutchanging pymorphy2 source code, develop domain-speciﬁc morphology analysispipelines and adapt pymorphy2 to work with languages other than Russian. Thelatter point is validated by introducing an experimental support for Ukrainianlanguage.345https://www.python.org/http://pymorphy2.readthedocs.orghttp://pypy.org/3Analysis of Vocabulary Wordspymorphy2 relies on large lexicons for analysis of common words. For Russianit uses OpenCorpora [3] dictionary (∼ 5 ∗ 106 word forms, ∼ 0.39 ∗ 106 lemmas)converted from OpenCorpora XML6 format to a compact representation optimized for morphological analysis and generation tasks.

End users don’t have tocompile the dictionaries themselves; pymorphy2 ships with prebuilt periodicallyupdated dictionaries.Any dictionary in OpenCorpora XML format can be used by pymorphy2.For Ukrainian there is such experimental dictionary (∼ 2.5 ∗ 106 word forms)being developed7 by Andriy Rysin, Dmitry Chaplinsky, Mariana Romanyshynand other contributors; it is based on LanguageTool8 data.Source dictionary contains word forms with their tags, grouped by lexemes.For example, a lexeme for lemma "ёж" (a hedgehog) looks like this:ёжежаежу...ежамиежахNOUN,anim,masc sing,nomnNOUN,anim,masc sing,gentNOUN,anim,masc sing,datvNOUN,anim,masc plur,abltNOUN,anim,masc plur,loctIn source dictionaries there could also be links between lexemes.

For example,lexemes for inﬁnitive, verb, gerund and participle forms of the same lemma maybe connected. Currently pymorphy2 joins connected lexemes into a single lexemefor most link types.3.1Morphological Analysis and GenerationGiven a dictionary, to analyze a word means to ﬁnd all possible grammaticaltags for a word. Obtaining of a normal form (lemmatization) is ﬁnding the ﬁrstword form in the lexeme.

To inﬂect a word is to ﬁnd another word form in thesame lexeme with the requested grammemes.As can be seen, all these tasks are simple. With an XML dictionary analysisof known words can be performed just by running queries on XML ﬁle.The problem is that querying XML is O(N) with large constant factors, rawdata takes quite a lot of memory, and the source dictionary is not well suited formorphological analysis and generation of out-of-vocabulary words.To create a compact representation and enable fast access pymorphy2 encodeslexeme information: all words are stored in a DAFSA [5] using the dawgdic9 C++library [11] via Python wrapper10 ; information about word tags and lexemes is678910http://opencorpora.org/?page=exportConversion utilities: https://github.com/dchaplinsky/LT2OpenCorporahttps://languagetool.org/https://code.google.com/p/dawgdic/https://github.com/kmike/DAWGencoded as numbers.

Storage scheme is close to the scheme described in aot.ru[10], but it is not quite the same.Paradigms Paradigm in pymorphy2 is an inﬂection pattern of a lexeme. Itconsists of pref ixi , suf f ixi , tagi triples, one for each word form in a lexeme,such as that each word form i can be represented as pref ixi + stem + suf f ixiwhere stem is the same for all words in a lexeme.This representation allows us to factorize a lexeme into a stem and a paradigm.Paradigm preﬁxes, suﬃxes and tags are encoded as numbers by pymorphy2;lexeme stems are discarded.

It means that a paradigm is stored as an array ofnumbers (preﬁxes, suﬃxes and tags IDs), and lexemes are not stored explicitly- they are reconstructed on demand from word and paradigm information.There are no paradigms provided in the source dictionary; pymorphy2 infersthem from the lexemes. For Russian there are about 3200 paradigms inferredfrom 390000 lexemes.Word Storage Word forms with their analysis information are stored in aDAFSA. Other storage schemes were tried, including two tries scheme similar todescribed in [9] (but using double-array tries), and succinct (MARISA11 ) tries.For pymorphy2 data DAFSA provided the most compact representation, and atthe same time it was the fastest and had the most ﬂexible iteration support.02147И510386sep22103sep912221021015130316И32142sep510417Fig.

1. DAFSA encoding example. Encoded (word, paradigmId, formIndex) triples:(двор, 103, 0); (ёж, 104, 0); (дворник, 101, 2); (дворник, 102, 2); (ёжик, 101, 2);(ёжик, 102, 2)For each word form pymorphy2 stores (word, paradigmId, formIndex) triples:– word form, as text;– ID of its paradigm;– word form index in the lexeme.DAFSA doesn’t support attaching values to leaves; the information is encoded like the following: < word > SEP < paradigmId >< f ormIndex > (seean example on ﬁg. 1)12 .1112https://code.google.com/p/marisa-trie/pymorphy2 encodes words to UTF-8 before putting them to DAFSA, so in practicethere are more nodes than shown on ﬁg.

1. It is an implementation detail.The storage is especially eﬃcient because words with similar endings oftenhave the same analyses, i.e. the same (paradigmId, f ormIndex) pairs; this allows DAFSA to use fewer nodes/transitions to represent the data. DAFSA forRussian OpenCorpora dictionary (5 ∗ 106 analyses, about 3 ∗ 106 unique wordforms) enables fast lookups (hundreds thousand lookups/sec from Python) andtakes less than 7MB of RAM; source XML ﬁle is about 400MB on disk.To get all analyses of a word, DAFSA transitions for word are followed, thena separator SEP is followed, and then the remaining subtree is traversed to getall possible (paradigmId, f ormIndex) pairs.Given (paradigmId, f ormIndex) pair one can ﬁnd the grammatical tag of aword: ﬁnd a paradigm in paradigms array by paradigmId, get (pref ixi , suf f ixi , tagi )triple from a paradigm by using i := f ormIndex.

Given (paradigmId, f ormIndex)pair and the word itself it is possible to restore the lexeme and lemmatize orinﬂect the word - from word, pref ixi and suf f ixi we can get the stem, andgiven a stem and (pref ixk , suf f ixk , tagk ) it is possible to restore a full word fork-th word form.3.2Working with "ё" and "ґ" Characters EﬃcientlyThe usage of "ё" letter is optional in Russian; in real texts it is often replacedwith "е" letter. There rules for "ґ" / "г" substitutions are diﬀerent in Ukrainian,but in practice there are real-world texts with "ґ" letters replaced with "г".The simplest way to handle it is to replace "ё" / "ґ" with "е" / "г" bothin the input text and in the dictionary.

However, this is suboptimal becauseit discards useful information, makes the text less correct (in Ukrainian "г"instead of "ґ" can be seen as a spelling error) and increases the ambiguity: thereare words which analysis should depend on е/ё and ґ/г. For example, the word"все" should be parsed as plural, but the word "всё" shouldn’t.pymorphy2 assumes that "ё" / "ґ" usage in dictionary is mandatory, but inthe input text it is optional. For example, if a Russian input word contains "ё"letter then only analyses with this letter are returned; if there are "е" letters inthe input word then possible analyses both for "е" and "ё" are returned.An easy way to implement this would be to check each combination of е/ёand ґ/г replacement for the input word.

Характеристики

Тип файла

PDF-файл

Размер

190,08 Kb

Материал

Аннотации

Тип материала

Другое

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Тип файла PDF

PDF-формат наиболее широко используется для просмотра любого типа файлов на любом устройстве. В него можно сохранить документ, таблицы, презентацию, текст, чертежи, вычисления, графики и всё остальное, что можно показать на экране любого устройства. Именно его лучше всего использовать для печати.

Например, если Вам нужно распечатать чертёж из автокада, Вы сохраните чертёж на флешку, но будет ли автокад в пункте печати? А если будет, то нужная версия с нужными библиотеками? Именно для этого и нужен формат PDF - в нём точно будет показано верно вне зависимости от того, в какой программе создали PDF-файл и есть ли нужная программа для его просмотра.

Список файлов учебной работы

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.