Summary (1137501), страница 4
Текст из файла (страница 4)
Cluster the resulting vector space by means of hierarchical clustering.4. Select the three central objects in each cluster and remove the clusters that contain lessthan three objects.Steps 2-4 are always applied irrespective of the part of speech. Step 1, however, dependson the format of the minimal diagnostic test required for the reference word. We presumed thatin order to establish the meaning of adjectives and other one-place predicates it is sufficient tolook at the noun that occupies the position of the word’s only actant. Therefore, in the case ofadjectives we retrieved the nouns that occur to the right of the reference word in the mainsubcorpus of the RNC; for verbs we retrieved the nouns in the Nominative Case that occureither to the left or to the right of the reference word.The algorithm was developed and tested on four fields of qualitative features (‘sharp’,‘smooth’, ‘straight’, and ‘thick’) and one verbal field (‘oscillation’).
For each of the fields weevaluated the recall and the precision of the resulting questionnaire. The recall was measured asthe percentage of frames that are represented in the questionnaire by at least one illustration; theprecision was estimated as the purity of the clusters.RecallPrecisionF-measure‘sharp’0,7330,8270,777‘straight’10,8170,899‘smooth’0,80,6750,732‘thick’10,8840,938‘oscillation’ 0.8820.7620.818Table 1. Quantitative evaluation of the algorithm’s performanceTable 1 displays the measures of the algorithm’s performance in each of the tested fields.As can be seen, the overall performance is fairly good, but some of the fields yield much betterresults than the others.
This discrepancy can be accounted for by a number of factors.The first factor is the frequency of the reference lexeme; higher frequency (i.e. a greaternumber of occurrences of the lemma in the corpus) results in better performance. The low Fmeasure for the field of oscillation can be attributed to the low frequency of its lexemes and thesubsequent inadequacy of the vector representations and clustering.13Secondly, the quality of performance depends on the number of frames in the field. Fieldsare better clustered into semantically homogeneous groups when they contain fewer frames.This explains the high quality of clustering of the field ‘straight’: it contains seven frames, andeach frame is represented by a large number of contexts.Thirdly, the quality of a questionnaire is affected by the nature of the oppositions thatorganise the semantic structure of the field.
The suggested method for automated questionnairegeneration groups contexts according to their taxonomic classes. For example, the Russianwords potomok (‘descendant’) and predšestvennik (‘predecessor’) from the class of humanbeings are placed into one cluster of the field ‘straight’, while alleja (‘parkway’) and dorožka(‘pathway’) belonging to the class of extended areas are referred to another cluster. In mostcases this leads to the desired partitioning of contexts into frames. However, it is not always thecase that frames in a field are juxtaposed to each other in accordance with the taxonomicclassification of nouns; in some cases it is the topology of the object that matters. For example,the two frames of the field ‘sharp’ – ‘instrument with a sharp functional edge’ (e.g.
a knife or asword) and ‘instrument with a sharp functional end-point’ (e.g. a needle or a bradawl) – belongto the same taxonomic class of nouns (instruments), but differ in the topological characteristicsof the objects they describe: the first one is characterised by a linear segment while the secondone – by a point-like segment. Differences of this kind are captured by the algorithm much lessreliably.The factors enumerated above are not equally relevant. For example, the algorithmdelivered the best performance for the field ‘thick’ despite the fact that this field is essentiallystructured according to the topological classification of objects. This effect may be due to thehigh frequency of the adjectives of this field and the low number of frames in its semanticstructure.
Besides, the topological and the taxonomic classifications of nouns often correlatewith each other, thus contributing to the purity of clustering. For instance, many body partsbelong to the topological class of elongated objects (e.g. thick fingers, arms, or legs), whilepieces of clothing often belong to the topological class of flexible layers (e.g. a thick jacket,coat, or sweater).Chapter 4 “Methods for automated data collection” describes the approaches that facilitatecollection of data; two tasks are addressed at this stage: (1) translating the minimal contextsfrom the questionnaire; and (2) filling in the questionnaire with data from the relevantlanguages.
We experimented with the fields of qualitative features (‘sharp’, ‘smooth’, ‘thick’,and ‘thin’); therefore, task (1) consisted in translating the list of adjectives belonging to the fieldalong with the list of nouns that may potentially co-occur with them.Translation of adjectives is no trivial matter. Traditionally, the task of translation(including machine translation) is regarded as either matching a context with the most suitablelexeme, or finding the most frequent / accurate translation equivalent of a lexeme, or generatingthe best equivalents for each of the meanings of the original word. Our task is different fromthose above; we need to obtain the adjectives that translate the original words only in thecontexts that correspond to their direct meanings.
For example, among the English translationequivalents of the Russian word ostryj we would like to see the adjectives sharp and pointed butnot critical or urgent (cf. ostraja nexvatka (‘critical shortage’), or ostryj vopros (‘urgentmatter’)).14We tested several algorithms (which are described in detail in the main text of the thesis)and opted for the method based on the machine-readable dictionaries of the FreeDict group. Theadvantage of these dictionaries is that translations are aligned with word meanings; ouralgorithm picks the translation equivalents of only the first meanings and then double-checksthem by means of back-translation, when the candidate is translated back into the sourcelanguage.
The candidate is added to the final list only if the adjective that corresponds to its firstmeaning is contained in the initial list. Nouns are translated in a similar manner, with a slightmodification: if a noun is not present in the FreeDict dictionary, it is translated with a machinereadable dictionary by Yandex company.After that the questionnaire is converted into the tabular format where columns are headedas adjectives and rows are headed as nouns. The table is filled with data from the availablecorpora: if an adjective co-occurs with a noun in a corpus, we compute the mutual informationfor this pair. Combinations with the negative value of mutual information are considered to berandom and are excluded from the final questionnaire.Chapter 5 “Automated generation of semantic maps with formal concept lattices”describes the methods used to automate the final stage of the analysis.
Special focus is placedon the theory of formal concept analysis (Ganter, Wille 1999) which introduces a special kindof diagrams known as formal concept lattices (FCLs). We maintain that such diagrams can beused in linguistic research as a new type of semantic maps.FCLs are based on the so-called formal contexts. Formal context K = (G, M, I) is a set ofobjects (G), a set of attributes (M), and the binary relation (I) between the objects and theirattributes.
A formal concept is a pair (A, B) where A is a subset of G and B is a subset of M sothat B contains all the attributes that characterise the objects in A, and А contains all the objectsthat feature the attributes from В within a given formal context. FCLs represent data as ahierarchy of formal concepts where concepts are ordered from more generic to less generic(those that cover a smaller number of objects).In our case, the objects are represented by lexemes, and attributes are represented byframes.
A lexeme and a frame form an incident pair if the lexeme covers the frame. Weexperimented with ten fields of qualitative features (‘sharp’, ‘soft’, ‘smooth’, ‘rough’, ‘hard’,‘empty’, ‘thick’, ‘thin’, ‘high’ and ‘low’) and with the verbal field of falling.To the best of our knowledge, this method has not been used in linguistics before (Priss2005 is one of the few exceptions); our experiments demarcated the limits of its applicability tolexical typological research. We established that FCLs can be used as is, without anymodifications, to represent the fields with the linear structure – such that form a chain-likeconfiguration in the conventional semantic maps (Frame 1 - Frame 2 - Frame 3). In this case,the arrangement of the nodes in a lattice corresponds to the arrangement of frames in aconventional semantic map; however, lattices are generated automatically while conventionalsemantic maps are designed manually.FCLs go beyond automating the design of maps: they open up new opportunities forlinguistic research.
Firstly, the hierarchical arrangement of nodes in a lattice makes it possibleto show in one diagram all of the lexicalization strategies that are available for a given field (incontrast to the traditional semantic mapping technique where convergences of frames aredepicted separately for each language), see Fig. 6. This, in turn, considerably simplifies the15typological analysis: it appears that some of the combinations that are considered admissible ina conventional semantic map are never or very rarely realised, while others are highly frequent.For instance, the FCL for the field ‘sharp’ indicates the two most frequent strategies in oursample: the dominant, in which all the major frames are covered by one lexeme, and the binary,in which one lexeme denotes instruments with a sharp functional edge (knives, saws, razors,etc.) and the other lexeme covers instruments with a sharp functional end-point (spears, arrows,etc.) and elongated objects (a nose, the toes of a boot, etc.).