The Russian National Corpus is a representative collection of texts in Russian, counting more than 2 bln tokens and completed with linguistic annotation and search tools
Search in corpora
News
Show allThe Syntactic Corpus has enhanced its search functionality for microsyntactic constructions. When the cursor is placed in the search field, a complete list of over 3,200 units appears. By entering a letter or sequence of letters, users can retrieve all units that contain them.
In the browser version of the site, users can view examples of constructions containing variables: these are displayed when hovering the mouse cursor over the name of a microsyntactic construction.
Search across multiple microsyntactic units is available using the logical OR operator (|). After selecting the first microsyntactic construction in the search field, placing the cursor back in the field automatically adds the disjunction sign to the query condition. The logical operator AND (&) is not supported in this search field. Search for words that are included in several constructions simultaneously can be made using conditions for several consecutive words with a distance of 0 between them. The asterisk operator (*) continues to work, allowing users to find words that belong to any microsyntactic construction.
The team of the Russian National Corpus will present two papers at the largest linguistics conference — ACL 2025!
In the main conference track, we will present the paper BERT-like Models for Slavic Morpheme Segmentation. In this work, we used fine-tuned BERT-like models to perform morpheme segmentation for three Slavic languages — Russian, Belarusian, and Czech. Our proposed algorithm outperformed existing approaches for Russian and Czech: the number of annotation errors was reduced by 1.5 to 2 times, especially for roots not seen in the training data. By the way, the updated morphological annotations available in the Main Corpus were generated using this very algorithm!
At the Slavic NLP 2025 workshop, we will present a study focused on improving our lemmatization model. Despite the already high accuracy of automatically assigned lemmas (98.8% correct analyses on the test set), we continue working to eliminate the remaining errors. After all, given the current size of the RNC, even a 0.1% error rate results in several million incorrect lemmatizations. By combining the Rubic model with a fine-tuned BART model in an ensemble, we were able to further improve quality and surpass 99% lemma accuracy, with notable improvements in the lemmatization of proper names and certain abbreviations.
We are actively developing methods for linguistic annotation of texts. Most of the models we’ve developed are available on the corresponding page of the Corpus website.
In February, we welcomed a new “Definitions” widget in the Word at a Glance for 5,500 words. Now, automatically generated definitions are available in the Word at a Glance function within the Main corpus for approximately 96,000 words. This makes the RNC a much more powerful reference tool. These definitions are featured for nouns, adjectives, verbs, and adverbs represented in the corpus.
Definitions are provided for both commonly used words and neologisms, for example, the words кидалт 'kidult' and байопик 'biopic' recently borrowed into Russian. In creating the definitions, we followed four key principles: accuracy, accessibility for middle school students, grammaticality, and conciseness.
Experiments in automatic definition generation have been conducted with the support of the Yandex Cloud Center for Social Technologies. Currently, this feature is available in beta for authorized users. We invite you to leave feedback using the “Rate” button. This will help us improve the quality of the definitions.