National Research University Higher School of Economicsas a manuscriptDaniil SkorinkinSEMANTIC MARKUP OF LITERARY TEXTS FOR QUANTITATIVESCHOLARSHIP IN PHILOLOGY (ON THE BASIS OF LEO TOLSTOY’SWAR AND PEACE)PhD Thesis Summaryfor the purpose of obtaining academic degreeDoctor of Philosophy in Philology and Linguistics HSEAcademic Supervisor:Candidate of SciencesBonch-Osmolovskaya A.A.Moscow 2018OverviewDigital literary studies are a major part of contemporary literary research.

Growing availability oftext in digital form and novel methods of electronic text analysis create new frontiers in the studiesof literary heritage. Aside from extracting grammatical information, contemporary naturallanguage processing tools allow for semantic as well as pragmatic analysis of texts.Unlike many other branches of today’s digital humanities, computational literary studies have astrong tradition from the non-digital era: since at least late XIX century scholars were applyingquantitative methods to authorship attribution, establishing creation dates of texts and other formsof what would later be known as ‘computational criticism’. The beginning of the XX century sawthe emergence of Russian formalism, with its positivistic tendencies, desire to use ‘scientific’methods and study literary works as formal object. In the 1950-es a wave of structuralism (togetherwith semiotic literary criticism) came into the humanities, following the lead of structuralisttradition in linguistics, but also revisiting Russian formalism.

Both periods saw some remarkableformalized, sometimes even computational researches (e.g. by B. Yarkho, Y. Lotman), despite thelack or outright absence of actual computers.Today our capabilities for working with date increased manifold. With the development ofcomputational tools, many research operations (esp. involving different sorts of corpora analysesand linguistic statistics) take seconds instead of months. However, the analysis of current researchin digital literary studies shows that there are still considerable hindrances that severely restrict thedevelopment of this promising field.For instance, it is still a challenge to extract clean structured data directly from the text.

Variouskinds of textual elements relevant for literary scholars tend to differ in terms of availability forcomputational analysis. It is fairly easy, for instance, to count word or n-gram frequencies in atext, and this might be enough for some applications in computational stylistics. At the otherextreme, it is not yet possible to produce universal automatic extraction tools (or even consistentformal model) for the elements of the plot.Fictional characters are positioned somewhere in between. On the one hand, a character inliterature can almost always be tracked down to the very concrete sequence of words (names andname phrases, pronouns etc.).

On the other hand, measuring and modeling literary characters ismuch harder than counting the frequencies of words. Even counting the number of occurrences ofa single character in a big text might be a considerable challenge — one needs to account fordifferent names and aliases, anaphoric mentions and so on. Things get even more complicated ifone is interested in capturing actions of a character: speech acts, interactions with other charactersand so on. For this reason, a lot of research that attempts computational modeling and analysis ofcharacter systems tends to focus on dramatic texts, which are a much easier target for2computational processing due to their specific structure.One could hope for the development of natural language processing algorithms sophisticatedenough to extract the necessary information.

A more sustainable solution and realistic solution,however, would be to use standardized semantic markup. Textual markup adds a machinereadable semantic annotation layer to the text, e.g. all mentions of a character in the text, identifiedwith a unique ID, or all instances of direct speech. This layer can be automatically converted intostructured data, e.g. a table containing all instances of direct speech, each row associated with thespeaker character.

This allows for easy and reproducible quantitative research of character systemsand character spaces.This thesis is dedicated to the creation of a semantic markup layer for Leo Tolstoy’s War andPeace. We then use the produced markup to test several methods of character space modeling andanalysis. Thus, the goal of the research is to develop and test a markup-based method of characterspace analysis in a large work of prose that has a well-developed character system. To reach thisgoal, we had to fulfil the following objectives:1.

Analyze related work of both non-computational and computational literary scholarshiprelated to modeling and formalization of character.2. Produce markup for character mentions in Leo Tolstoy’s War and peace. Connect mentionsof a single character to a unique ID.3. Produce markup for speech instances in Leo Tolstoy’s War and Peace.

Connect eachspeech instance with the speaker and the addressee(s).4. On the basis of this markupa. Perform quantitative analysis of character idiolects.b. Perform analysis of character interactions through network analysis. Compareexisting methods of network analysis to demonstrate the difference on well-knownmaterial (War and Peace).The scientific novelty of the work is, firstly, in testing different methods of character spacemodeling on a single work (a standard practice in computational linguistics, it has not been usedin this particular field of digital literary studies), secondly, in applying state-of-the-art naturallanguage processing tools to automate a large share of the markup procedure, and thirdly, inintroducing new parametric features for fictional characters which became the objects of the study.The theoretical significance of the thesis consists in comparing various methods of quantitativeanalysis and modeling of the character space.

This comparison takes place on the well-knownmaterial, and the semantic markup, which allows easy reproduction, is freely available to otherresearchers. The results of the comparison enable us to demonstrate for each method the particular3feature of the character system that it highlights or otherwise ignores. We show some limitationsthat were not previously taken into account or reported by researchers.The practical significance of the thesis is, first and foremost, in the creation of a freely availablesemantic markup. As markup is based on an international standard (TEI/XML), it allowsresearchers from all over the world to reproduce the work, adjust it to its’ own needs and buildfurther research upon it. In addition to that, network data and visualization created in the course ofthis research proved successful as teaching material.

They were used by the author of the thesisand other teachers at the Higher School of Economics lyceum (2017/2018 academic year), duringthe April crash-course in Digital Humanities at Helsinki University (2018), and in the lecturesorganized by the Higher School of Economics Centre for Digital Humanities.Public demonstrations of the results.The main findings of the research were presented at:● International conference for young philology scholars in Tartu (twice, in 2015 and 2017),● Dialogue — International conference on computational linguistics and intellectualtechnologies (twice, in 2015 and 2017)● Digital Humanities 2015 — Annual Conference of the Alliance of Digital HumanitiesOrganizations (Sydney, July 2015)● Digital Humanities 2016 — Annual Conference of the Alliance of Digital HumanitiesOrganizations (Krakow, July 2016),● TEI Conference and Members’ Meeting 2016 (Vienna, September 2016),● 6th AIUCD Conference 2017 (Rome, January 2017)● DH Russia 2017 conference (Krasnoyarsk, September 2017)● Natural Science Methods in the Digital Humanitarian Environment conference (Perm, May2018).The following propositions are submitted for the defense:1.

Modern natural language processing tools are suitable for extracting meaningfulinformation about character system and storing it in the form of semantic markup.2. The produced markup allows the analysis of the character system using quantitativemethods (frequency analysis, multivariate statistical analysis, correlation analysis, networkanalysis).3.

The choice of a specific method for analyzing the data obtained from the markup definesthe exact properties of the character system that will be reflected in the resulting model.4This thesis consists of an introduction, main part with three chapters, conclusion section,bibliography and supplementary materials.Summary of the main body of the thesisThe first chapter is dedicated to theoretical aspects of building a formal model of character andcharacter system. This chapter includes analysis of related works and the description of markupprocedure for War and Peace. In the first section of the chapter we describe approaches to theformalization of characters that have been developed in the pre-digital era.

