ulanovav (Аннотации)

PDF-файл ulanovav (Аннотации), который располагается в категории "разное" в предмете "английский язык" издесятого семестра. ulanovav (Аннотации) - СтудИзба 2020-08-25 СтудИзба

Описание файла

Файл "ulanovav" внутри архива находится в следующих папках: Аннотации, 2. PDF-файл из архива "Аннотации", который расположен в категории "разное". Всё это находится в предмете "английский язык" из десятого семестра, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .

Просмотр PDF-файла онлайн

Текст из PDF

Контекстно-зависимый переводсловаря оценочных слов припомощи параллельных текстовУланов А. В. (alexander.ulanov@hp.com),Сапожников Г. А. (gsapozhnikov@gmail.com)Hewlett-Packard Labs Russia, Санкт-Петербургскийгосударственный университет, Санкт-Петербург, РоссияКлючевые слова: анализ мнений, оценочные слова, машинныйперевод, классификацияContext-dependent opinionlexicon translation withthe use of a parallel corpusUlanov A.

V. (alexander.ulanov@hp.com),Sapozhnikov G. A. (gsapozhnikov@gmail.com)Hewlett-Packard Labs Russia, St. Petersburg State University,St. Petersburg, RussiaKeywords: opinion mining, sentiment analysis, opinion words, machinetranslationUlanov A. V., Sapozhnikov G. A.1. IntroductionSentiment analysis is one of the most popular information extraction tasks bothfrom business and research prospective. It has numerous business applications, suchas evaluation of a product or company perception in social media. From the standpoint of research, sentiment analysis relies on the methods developed for natural language processing and information extraction.

One of the key aspects of it is the opinion word lexicon. Opinion words are such words that carry opinion. Positive wordsrefer to some desired state, while negative words — to some undesired one. For example, “good” and “beautiful” are positive opinion words, “bad” and “evil” are negative.Opinion phrases and idioms exist as well. Many opinion words depend on context, likethe word “large”. Some opinion phrases are comparative rather than opinionated, forexample “better than”.

Auxiliary words like negation can change sentiment orientation of a word.Opinion words are used in a number of sentiment analysis tasks. They includedocument and sentence sentiment classification, product features extraction, subjectivity detection etc. [12]. Opinion words are used as features in sentiment classification. Sentiment orientation of a product feature is usually computed based on the sentiment orientation of opinion words nearby. Product features can be extracted withthe help of phrase or dependency patterns that include opinion words and placeholders for product features themselves. Subjectivity detection highly relies on opinionword lists as well, because many opinionated phrases are subjective [14].

Thus, opinion lexicon generation is an important sentiment analysis task. Detection of opinionword sentiment orientation is an accompanying task.Opinion lexicon generation task can be solved in several ways. The authorsof [12] point out three approaches: manual, dictionary-based and corpus-based.

Themanual approach is precise but time-consuming. The dictionary based approach relies on dictionaries such as WordNet. One starts from a small collection of opinionwords and looks for their synonyms and antonyms in a dictionary [10]. The drawback of this approach is that the dictionary coverage is limited and it is hard to createa domain-specific opinion word list. Corpus-based approaches rely on mining a review corpus and use methods employed in information extraction. The approach proposed in [9] is based on a seed list of opinion words. These words are used togetherwith some linguistic constraints like “AND” or “OR” to mine additional opinion words.Clustering is performed to label the mined words in the list as positive and negative.Part of speech patterns are used to populate the opinion word dictionary in [21] andInternet search statistics is used to detect semantic orientation of a word.

Work [7]extends the mentioned approaches and introduces a method for extraction of contextbased opinion words together with their orientation. Classification techniques areused in [2] to filter out opinion words from text. The approaches described were applied in English. There are some works that deal with Russian. For example, paper [4]proposes to use classification. Various features, such as word frequency, weirdness,and TF-IDF are used there.Most of the research done in the field of sentiment analysis relies on the presence of annotated resources for a given language. However, there are methodsContext-dependent opinion lexicon translation with the use of a parallel corpuswhich automatically generate resources for a target language, given that there aretools and resources available in the source language.

Different approaches to multilingual subjectivity analysis are studied in [14] and [1] and are summarized in [3].In one of them, subjectivity lexicon in the source language is translated with the useof a dictionary and employed for subjectivity classification. This approach deliversmediocre precision due to the use of the first translation option and due to wordlemmatization.

Another approach suggests translating the corpus. This can be donein three different ways: translating an annotated corpus in the source language andprojecting its labels; automatic annotation of the corpus, translating it and projecting the labels; translating the corpus in the target language, automatic annotationof it and projecting the labels.

Language Weaver1 machine translation was usedon English-Roman and English-Spanish data [3]. Classification experiments withthe produced corpora showed similar results. They are close to the case when testdata is translated and annotated automatically.

This shows that machine translationsystems are good enough for translating opinionated datasets. It is also confirmedby the authors of [19] when they used Google Translate2, Microsoft Bing Translator3and Moses4.Multilingual opinion lexicon generation is considered in the recent paper [19]that presents a semi-automatic approach with the use of triangulation. The authorsuse high-quality lexicons in two different languages and then translate them automatically into a third language with Google Translate.

The words that are found in bothtranslations are supposed to have good precision. It was proven for several languagesincluding Russian with the manual check of the resulting lists. The same authors collect and examine entity-centered sentiment annotated parallel corpora [20].In this paper we develop the idea of multilingual sentiment analysis. We proposea method for projecting an opinion lexicon from a source language to a target languagewith the use of a parallel corpus. We apply it to the language pair English-Russian having a collection of a parallel and a pseudo-parallel review corpora.

The method is evaluated against the baseline, which is a translation of the opinion word lexicon withGoolge Translate. Sentiment classification experiments are conducted to evaluate thequality of the lexicons. The advantages of our method are the following. It capturesthe context of opinion words thus producing correct translations. It doesn’t requirea machine translation tool, as in [19] or a bilingual dictionary as in [14]. However,machine translation tool may be employed in the absence of parallel corpus or forbetter recall. The opinion lexicon is needed only in one language, unlike in work [19]where 2 lexicons are required.1http://www.sdl.com/products/automated-translation/2http://translate.google.com/3http://www.bing.com/translator4http://www.statmt.org/moses/Ulanov A.

V., Sapozhnikov G. A.2. ApproachThe idea of our approach is to use a parallel corpus to construct an opinion lexicon in a target language, given that there is an opinion lexicon in a source language.A parallel corpus is a text with its translation to the target language. We suppose thatit contains opinionated sentences. An opinion lexicon is a set of words carrying opinion. It is not necessarily divided into positive/negative or other groups.

The opinionlexicon for the target language is extracted from the parallel corpus by translating thewords from the opinion lexicon in the source language. The algorithm of the methodis as follows:1. Collect a corpus of parallel reviews, align sentences2. Compute word lexical translation probabilities3. Collect opinion words translations and normalize themLet us consider the mentioned steps in greater details. The task of parallel corpus acquisition and preparation is a well-studied area of research [8]. One collectsor crawls data that is available in different languages.

Parallel documents are determined by some identifier, e.g. name, time, or specific number. Documents are splitinto sentences by the sentence splitter, paragraphs are kept preserved. The resultingtext is processed by the sentence aligner. A parallel corpus with opinionated texts canbe obtained from the sites that post reviews in different languages (manually translated). Usually, such reviews are editorial.

They contain opinionated text; howeveropinion words there tend to be more polite than in forums or user reviews. The sizeof the corpus is less important than the coverage of words from the source opinion lexicon. In the absence of a natural parallel corpus, a pseudo-parallel corpus can be used[20], which is a text along with its translation done by an automatic translation system.Lexical translation probabilities of words are computed on the aligned corpus:() and ()(),()language, s is a word in the source language. Lexical →→ where t is a word in the target () To compute it, one hasto() () ()translation is a translation of a word in isolation. count howmany →→ .. times a certain word was translated into different options within the aligned () → sentences.

Свежие статьи
Популярно сейчас