Summary_Скоринкин_14.12.2018 (1137486), страница 2
Текст из файла (страница 2)
We demonstrate thatthe view of fictional character as model dates back to the ancient world. In the XX century a majorsplit occurred between formal and psychological approaches to the study of fictional character.The formal approaches developed by formalism and later structuralism viewed fictional characteras a set of ‘text spans’ — a reductionist idea, which, nevertheless, turned out to be very relevantfor digital analysis. Psychological approaches, on the other hand, considered character to be a moreor less accurate model of personality.
One of their particular foci was the direct speech of fictionalcharacters, which was sometimes praised as the most ‘straightforward’ way for a writer to describea character. In the second half of the XX century, hybrid approaches started to emerge.The second section of the first chapter describes modern computational approaches to themodeling of fictional characters and character systems. In this section we show how ideas of nondigital scholarship (described in the first section) re-emerge in digital environment. For instance,structuralist approaches influenced contemporary field of literary network analysis, while someideas of the psychologically-oriented approaches affected current practices of researchingcharacter speech.
However, all modern computational approaches to the study of character systemsface the challenge of reliable data extraction from the text. Only the latest papers tend to usesemantic XML-based markup to address the issue, but such markup is mostly produced fordramatic texts.The third section of the chapter describes the markup we created for the text of War and Peace.The markup is consistent with TEI/XML — an international standard for encoding texts in thehumanities and digital preservation sphere. TEI/XML has been used to markup two layers ofsemantic data.
The first layer consists of character mentions, merged into coreference links throughunique character identifiers. To produce this layer automatically, we used ABBYY Comprenonamed entity recognition and information extraction tools, and then added our own list of namesto help coreference resolution. To test the quality of the resulting markup we used stander measuresfrom natural language processing and information retrieval, such as precision, recall, and Fmeasure.
The evaluation of the resulting markup demonstrated 78.2% overall F-measure on the5task of character extraction and identification, with precision reaching 94% and recall being around67%. The second layer of the markup consists of direct speech annotation. The speaker wasextracted partly automatically (in cases where s/he was mentioned explicitly, as in ‘said Natasha’),but most speakers and all addressees of direct speech were later marked up by hand.The second chapter describes the experiment in which a character space was built throughquantitative analysis of direct speech. Having extracted direct speech from the markup, we thenapplied two different methods of quantitative analysis to demonstrate the difference between them.The first method used was Delta [Burrows, 2002]1, a baseline stylometric tool for authorshipattribution and other tasks in the field of computational stylistics.
The method relies on thefrequency distribution of the most frequent words (lemmatized in our case) in the characters'speech. The size of the list of words was established during an experiment fundamentally similarto experiments on authorship attribution. Each character’s speech was treated as a corpus of worksby one author. The ‘corpora’ of each character were randomly divided into two comparablecollections — our training and test sets.
We then classified the documents in the test collectionusing a Delta classifier trained on the train set with different number of lemmas. The mostsuccessful classification (13 out of 14 samples identified correctly) happened when we used 130most frequent lemmas. In further experiments we consistently used 130 most frequent lemmas tocalculate Delta distances.As stylometric tools are sensitive to the size of texts, we only use those characters who speak atleast 1000 words over the course of the novel. This initially gave us a list of 16 characters. Wethen removed two characters speaking predominantly in French, and performed all subsequentexperiments with the remaining 14 characters. These were Andrey Bolkonsky, Natasha Rostova,Nikolai Rostov, Pierre Bezukhov, Marya Bolkonskaya, Vasily Kuragin, old prince NikolaiBolkonsky, countess Natalya Rostova (mother), count Ilya Rostov, Dolokhov, Denisov, Kutuzov,Anna Mikhaylovna Drubetskaya, Anna Pavlovna Scherer.We performed stylometric analysis of character speech using ‘stylo’ package for R, the mostwidely used Delta implementation.
To visualize Delta distances and character groupings we usedmultidimensional scaling, principal components analysis and hierarchical clustering.Burrows J. ‘Delta’: a Measure of Stylistic Difference and a Guide to Likely Authorship.Literary and Linguistic Computing. 2002. v. 17. no 3. pp. 267–287.16Fig. 1. 14 top speakers of war and peace in the stylometric space reduced to two dimensions withhelp of MDSThe most obvious division we identified is between the high-society group of Vasili Kuragin, AnnaShcherer and A. M. Drubetskaya, and the rest of the characters. This group is visible throughdifferent kinds of multivariate analysis that we applied.Fig. 2.
Hierarchical clustering of 14 characters with Delta distance7Other characters seem to cluster into the group of main characters, and the group of non-evilsecondary characters (with the possible exception of Dolokhov, whose overall image and role inthe novel is quite complex).The second method we used was our own homebrew approach specifically set to capture differentfeatures of character speech. The features here, unlike in stylometry, were mostly not connectedto the lexical content of the speech.
These were:1. The share of exclamatory sentences2. The share of question sentences3. Punctuation marks to speech ratio4. Discourse markers frequency (the only lexicon-related feature in this set)5. Readability, as measured by http://ru.readability.io/Thus, each character in the second experiment was represented as a 5-dimensional vector. We thenused similar methods of multivariate statistics to visualize and compare the results.Fig 3.
PCA of 14 characters with a set of alternative speech features.Here we again observed the distinction in the speech of V. Kuragin, A. P. Scherer andA. M. Drubetskaya from the rest of the characters. However, in this case the features areinterpretable. These characters tend to have speech with low readability, little share ofexclamations and questions.
Their complete opposite is Natasha Rostova, a character whose8speech is highly readable, abundant with discourse markers, punctuation marks, exclamations andquestions.We can also notice that the alternative method produces a rather different and more fine-graineddivision of characters into groups, as compared to stylometry.
This is also visible in the results ofhierarchical clustering (Fig. 4.)Fig 4. Hierarchical clustering of 14 characters with a set of alternative speech features.9Fig 5. A combination of PCA and (higher-level) hierarchical clustering for 14 characters with aset of alternative speech features.This kind of clustering seems to capture a different sort of similarity between characters. Forinstance, Ilya Rostov and Natasha Rostova represent the extreme of Rostov flamboyance andexpressiveness.
Denisov, towards the end of the book, becomes in many ways similar to the oldprince Bolkonsky with whom he clustered — a retired general unhappy with the officials for hiscareer misfortunes. But the most significant difference seems to be the separation of AndreyBolkonsky from the rest of the character space.
His speech is much less ‘readable’ (whichessentially means longer and more formalized), less expressive and less abrupt (low punctuationratio) than that of the other protagonists. This result seems very telling, as Tolstoy himselfhighlights several times that the young prince Bolkonsky speaks ‘dryly’ to people and is ‘reserved’.His obvious dissimilarity with Natasha in our visualization might actually reflect the verydifference between the two that caused countess Rostova fear that ‘Natásha had too much ofsomething, and that because of this she would not be happy’ with Bolkonsky.The third chapter describes the experiment of character space modeling through markup-basednetwork analysis. As our research in chapter 1 showed, current approaches to literary networkextraction fall into two major groups: co-occurrence-based approaches and conversational(dialogue-based) approaches.