Summary (1137510), страница 2
Текст из файла (страница 2)
employbilingual dictionaries for building the Turkish WordNet using existing WordNets.Multilingual resources represent the next stage in WordNet history. EuroWordNet,described by Vossen (1998), was build for Dutch, Italian, Spanish, German, French, Czech,Estonian and English languages. Tufis et al.
(2004) explain the methods used to createBalkaNet for Bulgarian, Greek, Romanian, Serbian and Turkish languages. These projectsdeveloped monolingual WordNets for a group of languages and aligned them to the structureof Princeton WordNet by the means of Inter-Lingual-Index. Following the creation ofPrincetonWordNet in 1998, similar lexical ontologies with a similar set of node types andrelations found their use in NLP. These are generally called WordNets and are developed formany languages. Methods for building new WordNets range from those based on humanlabor to mostly automated methods.
Typically automated approaches involve eitherextracting relations from machine-readable dictionaries or translating an existing wordnet,although other approaches were attempted too. While translation-based approaches are themost simple, they have a presumption that ontological structure of different languages issimilar, which is a questionable statement, especially for languages of different families.Several attempts were made to create Russian WordNet.
(Azarova et al. 2002) attempted tocreate Russian WordNet from scratch using merge approach: first the authors created the coreof the Base Concepts by combining the most frequent Russian words and so-called “core ofthe national mental lexicon”, extracted from the Russian Word Association Thesaurus, andthen proceeded with linking the structure of RussNet to EuroWordNet. The result, accordingto project’s site, contains more than 5500 synsets, which are not published for general use.Group of (Balkova et al. 2004) started a large project based on bilingual and monolingualdictionaries and manual lexicographer work. As for 2004, the project is reported to havenearly 145 000 synsets (Balkova et al.
2004), but no website is available (Loukachevitch andDobrov, 2014). (Gelfenbeyn et al. 2003) used direct machine translation without any manualinterference or proofreading to create a resource for Russian WordNet. Project RuThes by(Loukachevitch and Dobrov 2014), which differs in structure from the canonical PrincetonWordNet, is a linguistically motivated ontology and contains 158 000 words and 53 500concepts at the moment of writing. YARN (Yet Another RussNet) project, described by(Ustalov 2014), is based on the crowd-sourcing approach towards creating WordNet-likemachine readable open online thesaurus and contains at the time of writing more than 46 500synsets and more than 119 500 words, but lacks any type of relation between synsets.According to the view accepted within this thesis, the most efficient method for building anelectronic thesaurus is extracting relations from explanatory dictionaries, which makes itpossible to develop a large part of a thesaurus with the least contribution from an expert’sside.
Thus, previously introduced attempts to create a WordNet for the Russian languagedidn’t lead to freely-available complete lexical ontology conforming to a WordNet definition,and a place for the Russian resource of this kind still remains vacant.The aim of the present work is to create a methodology that results into a corpus ofthesaurus relations which can eventually be used for compiling a full-fledged thesaurus.Another question that we seek to find answers for is whether low resource – both expert andelectronic – is enough to obtain reliable results.
A tool enabling to develop a corpus ofthesaurus relations on the basis of limited resources will extensively broaden the range oflanguages supported by electronic thesauri.The accepted approach serves a basis for investigating the typology of taxonomicrelations between the basic concepts among speakers of different languages and includes thefollowing steps:● building up a corpus of definitions;● extracting triples «meaning – relation – lexeme» from definitions;● word sense disambiguation.This process generates chains of word meanings linked with thesaurus relations.Although these chains might require verification, they significantly simplify manualcontribution in the process of thesaurus development.
They may also be used as anindependent linguistic tool for numerous tasks that any thesaurus is apt for, as well as ameans of verification and enrichment of existing resources.***Chapter II «Monolingual dictionaries: a semi-structured resource» presents abrief overview of explanatory dictionaries available for the Russian language, it also gives athorough description of how the definition corpus is developed.With the development of lexicography, the structure of dictionary entries wasbecoming more and more homogeneous. Thus, some contemporary dictionaries define aclosed set of words that can be used in definition entries, enumerate all meanings and ascribecorresponding indices to words (LDOCE).
Lexicography today has a number of challenges:dictionaries are to be electronically based and their layout should be separated from theirpurpose and markup. Such dictionaries are referred to as machine readable (e.g. Der DanskeOordbog). Contributing new words to a dictionary is another challenge. An illustration of adictionary that is constantly enriched with new words is Wiktionary, however, due to itsnature the uniformity of its entries' structure is relatively low.The present thesis focuses on data extraction from dictionary definitions.
This choiceis motivated, among other reasons, by the fact that the language of dictionary entries is alimited subset of a natural language. As a source of data for the present work we took the BigRussian Explanatory Dictionary (BRED) by S.A. Kuznetsov. As BRED describes itself, wordmeanings are presented in three levels, however in the course of our work we will show thatit proposes a more extended hierarchy of meanings. The present research thoroughlydescribes the typical stages, necessary for developing a definition corpus on the basis ofBRED, the most important of which are building a hierarchical structure of a definition andextracting necessary information from a structured entry.Chapter II also gives an account of the difficulties that are connected withmorphological properties assignment on the basis of Kuznetsov's dictionary.
Thus, we haveconducted an experiment for Mystem tagging tool to evaluate its precision in POS annotationfor title words in the dictionary. The results show that within the subset of 1000 words theprecision of POS markup is 98.0%. Finally we obtained a corpus of noun definitions that isused in all the experiments set within the framework of the present study.***Chapter III «Relation extraction» describes the experiments on annotation indefinitions of words that hold thesaurus relation with the defined term.It gives a review of various methods used in relation extraction from dictionarydefinitions and based on a limited set of lexical and grammatical rules.
The main problemassociated with this approach stems from discriminating power of these rules. The chapterbriefly describes two experiments carried out within the framework of the present study.Both experiments are based on the corpus of noun definitions described in theprevious chapter.The first experiment shows that the extraction of hypernym relations based on a singlerule is not universally applicable. The rule designed within the experiment framework is: thefirst noun in the nominative is also a hypernym for the defined one. A test corpus is taggedfor checking the rule.
It is shown that the accuracy of extracting hyponym relations withinthis rule is 0,5. The second experiment focused on the possibility to improve the results withthe help of preliminary clusterization.The following three steps are carried out:● cluster word sense definitions,● annotate each cluster (as a whole) by a human expert,● summarize annotation results.The aim of clustering step is to reduce the amount of work for human annotators.Thus each cluster should have as few clusters as possible, while each cluster has as regularsyntactic and semantic coherence as possible.Given a cluster of word sense definitions an annotator has to answer three questions:● is it possible to extract WordNet relation from most definitions in the cluster, and if so,which relation it is,● what morphosyntactic rule can extract the relation,● for what fraction of definitions in the cluster does this rule give the correct answer?To measure the rule quality the expert assesses the result of rule application for 25cases in each cluster (or the whole cluster, if the cluster is smaller than 25 word senses).The annotation guideline strongly suggests to the expert to annotate exactly one ruleper each cluster.
This means that it is more harmful to merge unsimilar clusters in clusteringstep than to split similar definitions into several clusters.The aim of clustering in the work is to group together definitions that have the samepresentation style and will likely be parsed using the same morphosyntactic rule.
Author usedthe following assumptions about definition structure:● a few first words are usually enough to guess the article style● style manifests itself in syntactic structure● some styles manifest in presence of specific genus terms and have specific● coordination structure for them● dictionary authors strive to have a few standardized wording styles, and hence featuresdefining every style are frequent within the dictionary.This was accomplished by using the following set of features:● lexical unigrams: word-form, lemma● morphological unigrams: part of speech, every morphological feature as a tag● compound morphological unigrams: full morphological parse (gr), immutablemorphological features (immutable_gr, e.g.