linis (1185431)

Файл №1185431 linis (Аннотации)linis (1185431)2020-08-252020-08-25СтудИзба

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла

Koltsova O. Yu., Alexeeva S. V., Kolcov S. N.Тональный словарь и обучающаяколлекция для сентимент-анализасоциально-политических текстовКольцова О. Ю. (ekoltsova@hse.ru)1,Алексеева С. В. (salexeeva@hse.ru)1, 2,Кольцов С. Н. (skoltsov@hse.ru)1НИУ ВШЭ, Санкт-Петербург, Россия12СПбГУ, Санкт-Петербург, РоссияКлючевые слова: словарь тональной лексики, веб-интерфейс, краудсорсинг тональной разметки, российская блогосфера, «Живой журнал», размеченная коллекция, тематическое моделирование1. IntroductionSentiment analysis (SA) in Russia has so far been focused on polarity detectionin customer reviews: this, for instance, can be clearly seen from the content of theRussian Information Retrieval Seminar (ROMIP) competition on SA (Chetviorkinet al, 2012; Chetviorkin, Loukachevitch, 2013; Loukachevitch et al, 2015). However,marketing professionals are not the only potential “consumers” of automatic sentiment analysis techniques.

Social scientists get increasingly interested in “online public opinion” on various social and political issues or events, as well as in predictingpublic reaction to those events with online sentiment data. At the moment, no Russianlanguage sentiment lexicons or machine learning instruments are publicly available(exception: Chetviorkin-Loukashevich dictionary of sentiment-bearing words withundefined polarity for consumer reviews in three domains). As a result, researchersin Russia can only rely on commercial services whose methodologies are never completely disclosed. This is often unacceptable for academic users.This work seeks to make a first step in the development of freely available resources for the Russian language SA We develop a domain-specific sentiment lexiconand check its quality against the marked-up collection of political and social post fragments written by top bloggers at the most popular Russian blog platform LiveJournal.Our sentiment analysis task here is reduced to a relatively simple classification of textsinto those with prevailing negative emotions and those with prevailing positive emotions, irrespectively of the object of these sentiments—that is, we do not solve a political support/oppose classification task.The rest of the paper is organized as follows.

We first take a visit on the previousliterature. Next, we explain our data collection, sentiment lexicon formation, and themarkup procedure. Then, we report word and text assessment results and analyze thequality of the lexicon. Finally, we close the paper with a conclusion.An Opinion Word Lexicon and a Training Dataset for Russian Sentiment Analysis of Social Media2. Related workSA can be conventionally divided into two main approaches (Pang, Lee, 2008;Medhat et al, 2014):(1) Lexicon-based approach (Taboada et al, 2011). It browses texts for certainwords or phrases whose polarity has been predefined, often in relation to thedomain of interest.

Such thesauri are often supplemented with a set of rules,concerning the use of negation or booster words. Some of the well-knownlimitations of this approach are domain sensitivity and initial lexical insufficiency while its simplicity is one of its main advantages.(2) Machine learning approach.

It uses marked-up text collections (training datasets), as well as feature lists, as information which a mathematical algorithm relies on while classifying other marked-up collections (test sets). Mostof such algorithms optimize until the best possible fit with the test set markupis reached. After that, these algorithms are applied to non-marked-up (realworld) collections. This more sophisticated approach most often yields betterresults, however, it is vulnerable for overfitting and requires large markedup corpora to produce high quality.In addition, these two approaches work differently for different tasks. For instance, SVM method for the task of binary classification of English-language moviereviews has yielded precision of 86.4% (Pang, Lee 2004), which is particularly high.At the same time, a lexicon-based approach has been successfully used for sentimentanalysis of English language social media with SentiStrength system (Thelwall et al,2010) (for more details see section 4).

For the Russian language, during the ROMIPSA competition in 2012, the best results in consumer review classification tasks wereobtained by machine learning approaches, however, in political news classification,lexicon-based approaches took the lead (Chetviorkin, Loukachevitch, 2013). The competition organizers attribute this latter success to the great diversity of topics (subdomains) occurring in the news and to the absence of a sufficient training set.These two conditions are met by user-generated social and political content fromblogs and social media, the object of our interest, which is why we have chosen thelexicon-based approach as a first step.Two main methods of sentiment lexicon generation—manual and semi-automatic—are usually described in literature (Mohammad, Turney, 2013; Taboada et al,2011).

The manual method is a human markup of words into sentiment classes, whichcan be very reliable when qualified experts are used. Among limitations of this approach are its labor-intensive character (although not more intensive than in the creation of marked-up text collections) and the mentioned above initial insufficiency.The latter means that initially it is hard to think of all potentially sentiment-bearingwords without additional methods of their extraction.This problem is addressed by semi-automatic methods of lexicon generation, notably by bootstrapping techniques (Thelen, Riloff, 2002; Godbole et al, 2007).

Theystart with small lists of words with pre-defined polarity (seed words) and automaticallyextend them with a number of linguistic instruments. Those include measurementKoltsova O. Yu., Alexeeva S. V., Kolcov S. N.of sematic association between words (Turney, 2002), synonym/antonym dictionaries (Hu, Liu, 2004) or general dictionaries or various pre-existing taxonomies (Esuli,Sebastiani, 2005).

Sentiment lexicons developed for other languages are also applied (Mihalcea et al, 2007), although in our experience their usefulness is limited.Sentiment-bearing adverbs may be automatically derived from the respective seedadjectives (Taboada et al, 2011), which is a technique we have borrowed. Chetviorkinand Loukashevitch (2012) offer a methodology of detecting sentiment-bearing words(but not their polarity) for Russian language customer reviews: having manually annotated 18,362 words, they then train a classifier to detect more sentiment-bearingwords and show a good quality.Thus, semi-automatic approaches may solve the problem of labor-intensiveness in manual lexicon construction only partially, while marked-up collections canbe a solution only when they emerge without researchers’ effort (e.g.

consumer reviews). Classification of other types of content in resource-scarce languages facesa cold start problem. In recent years, it is increasingly often addressed with crowdsourcing, both in SA (Hong et al, 2013) and other linguistic tasks (Mohammad, Turney, 2011). Crowdsourcing, as a technique relying on cheap or free labor of a largenumber of lay persons, brings about its own problems, notably the issue of insufficientquality resulting from the lack of qualification or motivation. Approaches to copingwith this are still in their cradle.

While Hong et al (2013) suggest to motivate volunteers through gamification, Hsueh et al (2009) develop a number of methods to detectand discard low-quality assessments. The gold standard in both cases is, however,expert opinion, which itself is prone to individual biases when it comes to the polarityof political texts. In addition, resource-scarce languages may be resource-scarce precisely because crowdsourcing services are unavailable either for technical or financialreasons.

Характеристики

Тип файла

PDF-файл

Размер

772,34 Kb

Материал

Аннотации

Тип материала

Другое

Предмет

Английский язык

Высшее учебное заведение

МГУ им. Ломоносова

Тип файла PDF

PDF-формат наиболее широко используется для просмотра любого типа файлов на любом устройстве. В него можно сохранить документ, таблицы, презентацию, текст, чертежи, вычисления, графики и всё остальное, что можно показать на экране любого устройства. Именно его лучше всего использовать для печати.

Например, если Вам нужно распечатать чертёж из автокада, Вы сохраните чертёж на флешку, но будет ли автокад в пункте печати? А если будет, то нужная версия с нужными библиотеками? Именно для этого и нужен формат PDF - в нём точно будет показано верно вне зависимости от того, в какой программе создали PDF-файл и есть ли нужная программа для его просмотра.

Список файлов учебной работы

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.