pymorphy2 (1185429), страница 3
Текст из файла (страница 3)
For evaluation purposes alltags were converted to Mystem format using a set of automatic rules. Qualitywas evaluated on full morphological tags, i.e. tags must match exactly to beconsidered correct, with a few exceptions related to tags conversion problems.All reported errors were checked manually to filter out false positives.Table 1. Errorspymorphy2 mystem 3.0microcorpus 1015ruscorpora 98total 1923Both pymorphy2 and Mystem made less than 1% errors (without disambiguation, i.e.
in less than 1% cases the correct analysis was not in a set of analysesreturned by an analyzer). It should be noted that 9 out of 19 pymorphy2 errorsand 14 out of 23 Mystem errors were related to abbreviation handling. Mystemhandled first and last names better (1 mistake versus 4 for pymorphy2); pymorphy2 made less mistakes for "regular" words (4 versus 6 for mystem). Mystemcan’t parse many hyphenated words as a single token; such words were not considered. Punctuation, numbers and non-Russian words were also removed fromthe input.It is hard to draw a quantitative conclusion because the corpus size is small.Both analyzers has a similar analysis quality, and the resulting numbers dependon evaluation minutiae: whether abbreviations are considered or not, should werequire hyphenated words to be parsed, do we require verb transitivity to bepredicted correctly, is it important to distinguish adverbs from parenthesis, etc.131415https://tech.yandex.ru/mystem/https://github.com/kmike/microcorpushttp://nbviewer.ipython.org/gist/kmike/52fb0a9b3ed627310beaSeveral human annotation errors were found by parsing OpenCorpora datawith mystem (1 error) and ruscorpora data with pymorphy2 (6 errors).
OpenCorpora shares a dictionary with pymorphy2, and ruscorpora annotation is related to mystem; this shows an utility of using cross-corpora tools to check theannotations.The most sophisticated Russian morphological parser evaluation so far is [1];it happened in 2010. Previous version of pymorphy2 (pymorphy) participated16in tracks without disambiguation; it finished 1st on Full Morphology Analysis,3rd on Lemmatization, 3rd on POS tagging and 5th on the Rare Words track.pymorphy haven’t participated in disambiguation tracks.pymorphy used some pymorphy2 rules (not all) and a different dictionary(extracted from [10] instead of [3]).
Generally, pymorphy2 should work betterthan pymorphy because of an improved dictionary and rules, but this has notbeen not measured quantitatively yet.7Conclusion and Future PlansPermissive open-source license (MIT) is used for pymorphy2. All the dictionariesand corpora pymorphy2 depends on are also available under permissive opensource licenses. This encourages usage and contributions. There are volunteersworking on Russian and Ukrainian dictionaries and corpora, related tools andpymorphy2 itself.Development of pymorphy2 is by no means finished. There are word classesfor which pymorphy2 analysis can be improved.
Some of them: people last andpatronymic names, foreign people names, diminutive first names, locations, uppercase and other abbreviations, some classes of hyphenated words, ordinal numbers (including ordinal numbers written in digit notation like "22-й"). Accordingto [1], similar issues are common for Russian morphological analyzers.Non-contextual P (tag|word) estimates can be made better by transferringsome information about similar words and by improving the corpora.A better comparison between pymorphy, pymorphy2, Mystem and other morphological analyzers could require a robust tagset conversion library.The support for Ukrainian is experimental. The dictionary requires work,pymorphy2 needs more Ukrainian-specific rules for handling of out of vocabularywords, and for better P (tag|word) estimates an annotated Ukrainian corpus isneeded: even a small corpus (or even a manually crafted frequency list) shouldfix a substantial amount of "obvious" errors.There are plans to add Belarusian language support to pymorphy2 based onBelarusian N-korpus17 grammar database.Although pymorphy2 is already fast enough for many use cases (tens of thousands words per second in a single thread), there is a room for further speedimprovements.1617Anonymized results: http://ru-eval.ru/tables_index.htmlhttp://bnkorpus.infoReferences1.
Astaf ’eva I., Bonch-Osmolovskaya A., Garejshina A., Grishina Ju., D’jachkov V.,Ionov M., Koroleva A., Kudrinsky M., Lityagina A., Luchina E., Sidorova E.,Toldova S., Lyashevskaya O., Savchuk S.,Koval’ S.: NLP Evaluation: Russian Morphological Parsers. In: Kibrik A. (ed.). Computational Linguistics and IntellectualTechnologies. Papers from the Annual International Conference “Dialogue”. Volume1. (2010)2. Bocharov, V.V., Granovsky, D.V., Surikov, A.V.: Probabilistic Tokenization Modelin the OpenCorpora Project [Veroyatnastnaya model’ tokenizacii v proekte Otkritiy Korpus]. In: New Information Technology in Automated Systems: proceedingsof the 15th seminar [Noviye informacionnie tehnologii v avtomatizirovannih sistemah: materiali pyatnadcatogo nauchno-prakticheskogo seminara]. M., 2012.3.
Bocharov, V.V., Alexeeva, S.V., Granovsky, D.V., Protopopova, E.V., Stepanova,M.E., Surikov, A.V.: Crowdsourcing morphological annotation. In: Selegey V. (ed.)Computational Linguistics and Intellectual Technologies. Papers from the AnnualInternational Conference “Dialogue”. Volume 1. (2013).4. Bolshakov, I.
A., Bolshakova, E. I.: An Automatic Morphological Classifier of NounPhrases in Russian. In: Kibrik A. (ed.) Computational Linguistics and IntellectualTechnologies. Papers from the Annual International Conference “Dialogue” Volume1. (2012).5. Daciuk, J., Watson, B.W., Mihov, S., Watson, R.E.: Incremental Construction ofMinimal Acyclic Finite-State Automata.
Computational Linguistics 26(1) (March2000) 3-166. Daciuk, J.: Treatment of Unknown Words. In: proceedings of Workshop on Implementing Automata WIA’99, Potsdam, Germany, 1999, (C) Springer Verlag LNCSSeries Volume 2214, pp. 71-80, 2001.7. Krylov, S. A., Starostin, S. A.: Current Morphological Analysis and SynthesisChallanges in the STARLING system [Aktualniye zadachi morfologicheskogo analiza i sinteza v integrirovannoy informacionnoy srede STARLING]. In: Proceedingsof the International Conference “Dialog 2003” (2003)8. Mikheev, A.: Automatic Rule Induction for Unknown Word Guessing. In: Computational Linguistics, Vol.
23(3) (1997). 405-423.9. Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In Proc. of MLMTA-2003 LasVegas (2003).10. Sokirko, A.: Morphological Modules on the web-site www.aot.ru [Morphologicheskie Moduli na saite www.aot.ru].
Computational Linguistics and Intelligent Technologies: Proceedings of the International Conference “Dialog 2004” (2004).11. Yata, S., Morita, K., Fuketa, M., Aoe, J.: Fast String Matching with Space-efficientWord Graphs. Innovations in Information Technology (Innovations ’08) Al Ain,United Arab Emirates (December 2008) 79-8312. Zaliznjak, A. A.: Grammaticeskij slovar’ russkogo jazyka. Moscow, Russia (1977).13. Zanegina, N. N.: Improvised-temporary-compounds as a new expressive mean inRussian.
In: Kibrik A. (ed.) Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” Volume 1.(2012).14. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language.Cambridge, MA.: Harvard University Press (1932)..