Corpus Linguistics

The deeply annotated corpus of Russian texts SynTagRus (Syntactically Tagged Russian text corpus), which has been under development at the Laboratory for a number of years, is an important autonomous part of the Russian National Corpus. As of the beginning of 2023, it contains over 1.5 million words (around 107 thousand sentences). The corpus is a collection of texts by different authors and of different genres, in which each sentence is assigned a detailed syntactic structure in the form of a dependency tree. The corpus also contains other types of annotation: lexical-semantic annotation (for ambiguous words, their actual meaning in the text is specified), lexical-functional (expressions are identified that can be interpreted in terms of lexical functions), anaphoric (antecedents of pronouns are marked), microsyntactic (syntactically sensitive phraseological units are identified), temporal (words and expressions with temporal meaning are marked).