Файл "linis" внутри архива находится в следующих папках: Аннотации, 5. PDF-файл из архива "Аннотации", который расположен в категории "разное". Всё это находится в предмете "английский язык" из десятого семестра, которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст из PDF
Koltsova O. Yu., Alexeeva S. V., Kolcov S. N.Тональный словарь и обучающаяколлекция для сентимент-анализасоциально-политических текстовКольцова О. Ю. (firstname.lastname@example.org)1,Алексеева С. В. (email@example.com)1, 2,Кольцов С. Н. (firstname.lastname@example.org)1НИУ ВШЭ, Санкт-Петербург, Россия12СПбГУ, Санкт-Петербург, РоссияКлючевые слова: словарь тональной лексики, веб-интерфейс, краудсорсинг тональной разметки, российская блогосфера, «Живой журнал», размеченная коллекция, тематическое моделирование1. IntroductionSentiment analysis (SA) in Russia has so far been focused on polarity detectionin customer reviews: this, for instance, can be clearly seen from the content of theRussian Information Retrieval Seminar (ROMIP) competition on SA (Chetviorkinet al, 2012; Chetviorkin, Loukachevitch, 2013; Loukachevitch et al, 2015). However,marketing professionals are not the only potential “consumers” of automatic sentiment analysis techniques.
Social scientists get increasingly interested in “online public opinion” on various social and political issues or events, as well as in predictingpublic reaction to those events with online sentiment data. At the moment, no Russianlanguage sentiment lexicons or machine learning instruments are publicly available(exception: Chetviorkin-Loukashevich dictionary of sentiment-bearing words withundefined polarity for consumer reviews in three domains). As a result, researchersin Russia can only rely on commercial services whose methodologies are never completely disclosed. This is often unacceptable for academic users.This work seeks to make a first step in the development of freely available resources for the Russian language SA We develop a domain-specific sentiment lexiconand check its quality against the marked-up collection of political and social post fragments written by top bloggers at the most popular Russian blog platform LiveJournal.Our sentiment analysis task here is reduced to a relatively simple classification of textsinto those with prevailing negative emotions and those with prevailing positive emotions, irrespectively of the object of these sentiments—that is, we do not solve a political support/oppose classification task.The rest of the paper is organized as follows.
We first take a visit on the previousliterature. Next, we explain our data collection, sentiment lexicon formation, and themarkup procedure. Then, we report word and text assessment results and analyze thequality of the lexicon. Finally, we close the paper with a conclusion.An Opinion Word Lexicon and a Training Dataset for Russian Sentiment Analysis of Social Media2. Related workSA can be conventionally divided into two main approaches (Pang, Lee, 2008;Medhat et al, 2014):(1) Lexicon-based approach (Taboada et al, 2011). It browses texts for certainwords or phrases whose polarity has been predefined, often in relation to thedomain of interest.
Such thesauri are often supplemented with a set of rules,concerning the use of negation or booster words. Some of the well-knownlimitations of this approach are domain sensitivity and initial lexical insufficiency while its simplicity is one of its main advantages.(2) Machine learning approach.
It uses marked-up text collections (training datasets), as well as feature lists, as information which a mathematical algorithm relies on while classifying other marked-up collections (test sets). Mostof such algorithms optimize until the best possible fit with the test set markupis reached. After that, these algorithms are applied to non-marked-up (realworld) collections. This more sophisticated approach most often yields betterresults, however, it is vulnerable for overfitting and requires large markedup corpora to produce high quality.In addition, these two approaches work differently for different tasks. For instance, SVM method for the task of binary classification of English-language moviereviews has yielded precision of 86.4% (Pang, Lee 2004), which is particularly high.At the same time, a lexicon-based approach has been successfully used for sentimentanalysis of English language social media with SentiStrength system (Thelwall et al,2010) (for more details see section 4).
For the Russian language, during the ROMIPSA competition in 2012, the best results in consumer review classification tasks wereobtained by machine learning approaches, however, in political news classification,lexicon-based approaches took the lead (Chetviorkin, Loukachevitch, 2013). The competition organizers attribute this latter success to the great diversity of topics (subdomains) occurring in the news and to the absence of a sufficient training set.These two conditions are met by user-generated social and political content fromblogs and social media, the object of our interest, which is why we have chosen thelexicon-based approach as a first step.Two main methods of sentiment lexicon generation—manual and semi-automatic—are usually described in literature (Mohammad, Turney, 2013; Taboada et al,2011).
The manual method is a human markup of words into sentiment classes, whichcan be very reliable when qualified experts are used. Among limitations of this approach are its labor-intensive character (although not more intensive than in the creation of marked-up text collections) and the mentioned above initial insufficiency.The latter means that initially it is hard to think of all potentially sentiment-bearingwords without additional methods of their extraction.This problem is addressed by semi-automatic methods of lexicon generation, notably by bootstrapping techniques (Thelen, Riloff, 2002; Godbole et al, 2007).
Theystart with small lists of words with pre-defined polarity (seed words) and automaticallyextend them with a number of linguistic instruments. Those include measurementKoltsova O. Yu., Alexeeva S. V., Kolcov S. N.of sematic association between words (Turney, 2002), synonym/antonym dictionaries (Hu, Liu, 2004) or general dictionaries or various pre-existing taxonomies (Esuli,Sebastiani, 2005).
Sentiment lexicons developed for other languages are also applied (Mihalcea et al, 2007), although in our experience their usefulness is limited.Sentiment-bearing adverbs may be automatically derived from the respective seedadjectives (Taboada et al, 2011), which is a technique we have borrowed. Chetviorkinand Loukashevitch (2012) offer a methodology of detecting sentiment-bearing words(but not their polarity) for Russian language customer reviews: having manually annotated 18,362 words, they then train a classifier to detect more sentiment-bearingwords and show a good quality.Thus, semi-automatic approaches may solve the problem of labor-intensiveness in manual lexicon construction only partially, while marked-up collections canbe a solution only when they emerge without researchers’ effort (e.g.
consumer reviews). Classification of other types of content in resource-scarce languages facesa cold start problem. In recent years, it is increasingly often addressed with crowdsourcing, both in SA (Hong et al, 2013) and other linguistic tasks (Mohammad, Turney, 2011). Crowdsourcing, as a technique relying on cheap or free labor of a largenumber of lay persons, brings about its own problems, notably the issue of insufficientquality resulting from the lack of qualification or motivation. Approaches to copingwith this are still in their cradle.
While Hong et al (2013) suggest to motivate volunteers through gamification, Hsueh et al (2009) develop a number of methods to detectand discard low-quality assessments. The gold standard in both cases is, however,expert opinion, which itself is prone to individual biases when it comes to the polarityof political texts. In addition, resource-scarce languages may be resource-scarce precisely because crowdsourcing services are unavailable either for technical or financialreasons.