Summary (1137501), страница 3

Файл №1137501 Summary (Автоматизация лексико-типологических исследований методы и инструменты) 3 страницаSummary (1137501) страница 32019-05-202019-05-20СтудИзба

Автоматизация лексико-типологических исследований методы и инструменты

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 3)

Vector processing: vector weighting (none vs. PPMI vs. PLMI vs. PLOG vs. EPMI) anddimensionality reduction (none vs. SVD to 300 dimensions)6. Type of vector representation of phrases: the observed vector (when the focal wordcombination is regarded as a single lexical unit) vs. the composed vector (when the vector of thephrase is computed from the vectors of its constituent words according to one of the followingmodels of composition: additive, weighted additive, multiplicative, dilation, lexical function,practical lexical function, and PLF)7Thus, two vector representations were obtained for each micro-frame: the typological andthe distributional vectors.

After that, we calculated the typological and the distributionaldistances for all the possible pairs of micro-frames within each field and computed the Pearsoncorrelation coefficient between the two metrics.Both of the fields under study demonstrated very high correlation coefficients (0.766 forthe field ‘sharp’ and 0.946 for the field ‘smooth’). As some of the parameters of thedistributional models were run with various settings, it is important to note that the best resultsfor both fields were achieved with the same set of settings: the main subcorpus of the RNC asthe training corpus, vector weighting with PPMI, dimensionality reduction to 300, and the PLF(practical lexical function, see Paperno et al. 2014) model to vectorize phrases as a compositionof the vectors of their constituent words.

It also should be noted that the best results were shownfor the direct meanings of the adjectives; adding metaphoric frames significantly decreased theoverall performance: the Pearson correlation coefficient in this case was 0.462 for the field‘sharp’ and 0.604 for the field ‘smooth’. A possible explanation is that direct meanings possessa more distinct and, importantly, a more predictable frame structure than figurative meanings;although figurative meanings are motivated by direct ones, the distribution of figurativemeanings in the language is less even.Thus, despite the commonly held opinion (e.g.

see Bullinaria and Levy, 2012), the qualityof the model does not increase in proportion to the size of the training corpus; in our case, thesmall yet well-balanced main subcorpus of the RNC yields a higher result than the joint corpusof approximately 1.44bn tokens comprised of the main and the newspaper subcorpora of theRNC, and the ruWaC corpus (cf. a similar observation in Kutuzov, Kuzmenko 2015).The main subcorpus of the RNC proved to be sufficient enough for obtaining high-qualityvector representations of individual lemmas.

At the same time, even the joint training corpusdid not appear to be large enough to produce reliable combinability profiles of two-wordphrases; the quality significantly increases (as against the co-occurrence vectors) when weapply any of the composition models.Evidence from only two semantic fields does not allow us to make definitive conclusionswith a high degree of confidence; therefore, we performed two further experiments. In the firstexperiment we implemented the best parameters of the distributional models from the twoprevious experiments with adjectives; but this time we applied them to a verbal semantic field(‘oscillation’)1.

In the second experiment we trained the models on a new corpus, the EnglishukWaC, and tested them on the field of ‘sharp’. Thus, the invariable typological distances werecompared against the distributional distances between the respective English wordcombinations, e.g. sharp needle, sharp spear, sharp arrow, etc. Both of the experiments yieldedhigh Pearson correlation coefficients: 0.7 for ‘oscillation’ and 0.668 for ‘sharp’. These resultsoffer further support for the hypothesis that the frame structure of a field can be roughlyoutlined using data from only one language, and it is irrelevant which language provides thedata.Finally, the correspondences between the typological and the distributional spaces areconvincingly demonstrated by their visualizations.

We used multidimensional scaling to project1We thank Maria Shapiro who provided her typological data for this experiment.8the two spaces of each field onto planes, where each frame is coded with its specific colour.Figs. 1-3 show the plots for the typological and the distributional spaces of the field ‘sharp’.Green markers in all the plots depict the frame ‘instrument with a sharp functional edge’; theframe ‘instrument with a sharp functional end-point’ is coded with blue; yellow represents theframe ‘elongated object’, and red stands for the frame ‘object with prickly surface’.

It should beemphasised that these clusters were not based on the obtained maps; they were initiallydetermined on the basis of the typological studies carried out by the MLexT group. From hereon, we will refer to the ‘green’, ‘blue’, ‘yellow’, and ‘red’ clusters as the respective four framesof the field ‘sharp’.Figs. 1-3 demonstrate a remarkable effect. Visualization of the typological space (Fig. 1)accurately reflects the frame structure of the field 2. Visualization of the distributional space, onthe contrary, captures only the juxtapositions that are lexicalised in the given language. Forexample, the map in Fig. 2 was generated from the Russian data, and the contexts for theadjective koljučij (‘prickly’) stand out from the rest, while the frames of the adjective ostryj(‘sharp’) form a smooth continuum. Fig.

3 depicts visualization of the distributional space forthe field ‘sharp’ built on the basis of the French corpus (for visual clarity, it contains only theframes that are not differentiated in Russian). In the French data, the frame for ‘instrument witha sharp functional edge’ is distinctly juxtaposed to the frames ‘instrument with a sharpfunctional end-point’ and ‘elongated object’ because the first of these frames is denoted by theadjective tranchant while the other two correspond to pointu, and this juxtaposition islexicalised in French.2Note that multidimensional scaling was successfully used in typology for automated generation of semantic maps(see Croft and Poole 2008, Wälchli and Cysouw 2012, and others).

A more detailed discussion of multidimensionalscaling will follow in Chapter 5.9Fig. 1. Visualization of the typological space of the field ‘sharp’Fig. 2. Visualization of the distributional space of the field ‘sharp’ generated on the basis of the Russiancorpus10Fig. 3. Visualization of the distributional space of the field ‘sharp’ generated on the basis of the Frenchcorpus (without the frame ‘object with surface that pricks’)However, the picture changes dramatically if we do not indiscriminately plot all theobjects in a vector space; instead, we determine the nucleus of each frame, and plot only thesenuclei. We computed the means of each dimension in every “cluster” to define the centre ofeach frame, and mapped these new reduced spaces onto a plane.

As demonstrated in Fig. 5, themaps produced with this method from the data of only one language are identical to theconventional discrete semantic maps (Fig. 4).Fig. 4. Semantic map of the field ‘sharp’ compiled manually from typological data11Fig. 5. Automatically generated semantic map of the field ‘sharp’: the mapping of the vector spaceconstituted by the central objects of each frame clusterThere are three principal conclusions that follow from our findings:(1) Significant correlation is observed in all the four experiments between the typologicaland the distributional spaces; therefore, accurate manually collected typological datacan be used to evaluate the quality of distributional models.

Such metric has severaladvantages over the other existing methods (e.g. comparing distributional distancesagainst spontaneous judgements elicited from speakers of the language, or against thelength of the path in the tree of a thesaurus); the key advantage of our metric is itsobjectivity. Its major drawback is primarily associated with the lack of reliabletypological data; however, we expect that advancements in algorithms for automateddata collection will help to resolve this problem in the short term.(2) Our results offer further evidence that corroborates the presence of a linguisticallygrounded semantic reality behind the concept of frame.

Nevertheless, it would be anoversimplification to regard frames as points in the semantic space; this approach wasmotivated by manual processing of the data. In fact, the frame structure of a semanticfield seems to be continuous rather than discrete; however, this continuum ofmeanings does have distinct focal points (cf. Кибрик 2012); these are the frames thatin most cases define the principles of lexicalization of the field.(3) Distributional semantic methods and data from one language are sufficient to producea rough outline of a semantic field; moreover, the experiment with the English corpussuggests that the choice of the initial language does not affect the final result.12We have provided additional evidence that supports the concept of frame and demonstrates thatit is well-grounded both in theory and practice; now we move on to the discussion of thepossible methods that can be used to automate the stages of a research project in frame-basedlexical typology.Observation (3) from the previous chapter allowed us to develop a method for generatinglexical typological questionnaires from the data of one language.

The method is described inChapter 3 “Automated development of questionnaires with distributional semantic models”.The suggested algorithm produces draft questionnaires for typological studies ofadjectives and other one-place predicates, such as the verbs of motion, sound, or state.The algorithm consists of the following steps:1. Compile the list of nouns that co-occur with the adjectives / verbs under examination(in the main subcorpus of the RNC);2. Obtain the vector of co-occurrences for each word combination;3.

Характеристики

Тип файла

PDF-файл

Размер

431,83 Kb

Материал

Автоматизация лексико-типологических исследований методы и инструменты

Тип материала

Кандидатская диссертация

Предмет

Филология

Высшее учебное заведение

НИУ ВШЭ

Список файлов диссертации

avtomatizacija-leksiko-tipologicheskih-issledovanij-metody-i-instrumenty.rar

Автоматизация лексико-типологических исследований методы и инструменты

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.