Romanian Language Resources

Disclaimer: This is not an exhaustive list of Romanian language resources. Quite the opposite, this is a very small list of Romanian resources. A user looking for more resources is strongly advised to look into international repositories such as ELRC-SHARE and European Language Grid (ELG). Direct queries on these repositories for Romanian resources can be accessed here:

Romanian Named Entity Recognition in the Legal domain (LegalNERo)[ Download ]

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established). The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.

ROBIN Technical Acquisition Speech Corpus (RTASC)[ Download ]

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

ROTBL [ Download ]

ROTBL is a manually curated lexicon of written forms of Romanian. Each word form is paired with its possible POS tags (called Morph-Syntactic Descriptors, see for reference) and, for each possible POS tag, its lemma. Thus, we have triplets of the form "word form<TAB>lemma<TAB>POS tag" where the fields are separated by the TAB character. It is currently maintained with the latest text processing platform for Romanian, RODNA, and it is available here:

SemCor EN-RO

SemCor (Miller et al., 1993) is a word sense disambiguation corpus created from the Brown corpus (Kucera and Francis, 1967), in which every content word (noun, verb, adjective or adverb) is annotated with the sense label from Princeton WordNet 1.6. The parallel, English to Romanian corpus was developed by Ion (2007) and contains the updated sense annotations from Princeton WordNet 2.0 of a subset of the original SemCor that was professionally translated into Romanian, specifically for WSD experiments. The sense annotations were automatically transferred into Romanian by means of English to Romanian word alignment. Ion (2007) provides a detailed account of this operation.
Radu Ion. 2007. Word sense disambiguation methods applied to English and Romanian. PhD Thesis (in Romanian), Romanian Academy, Bucharest, May 2007.
Henry Kucera and Nelson W. Francis. Computational analysis of present-day American English. Brown University Press, Providence, Rhode Island, 1967
George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. A semantic concordance. In Proceedings of the 3rd DARPA Workshop on Human Language Technology, pages 303-308, Plainsboro, New Jersey, 1993.

Newspapers, Orwell's "1984" and Plato's "Republica"

These three corpora were developed and used in the tiered tagging experiments with POS tagging Romanian using the detailed MSD POS tag repository (called Morph-Syntactic Descriptors, see for reference). They contain MSD annotated tokens (words and punctuation) and have been manually checked for annotation errors. Thus, they can be used to train POS taggers for Romanian. The Romanian part of the parallel corpus Orwell's "1984" was developed during the MULTEXT-East project. A description of the corpus can be found here:
The corpora have been described (including statistics on the corpora size) in the following paper: Dan Tufiș. 2000. Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings of LREC 2000. Available online at