Pre-Trained Language Models

Annotation models for lemma, UPOS, XPOS and dependency parsing (where supported) trained on RRT UD 2.7.

These models were evaluated in Păiș, Vasile and Ion, Radu and Avram, Andrei-Marius and Mitrofan, Maria and Tufiș, Dan. In-depth evaluation of Romanian natural language processing pipelines. In Romanian Journal of Information Science and Technology (ROMJIST). vol. 24, no. 4, pp. 384--401, 2021. The article can be accessed here.

Classification models

  • PyEuroVoc - Classification of legal documents using EuroVoc descriptors, based on BERT models, for 22 languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Spanish, Slovak, Slovene, Swedish). A GitHub repo with scripts and example usage is available here. Related paper is Avram Andrei-Marius, Vasile Păiș, and Dan Tufiș. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021.
  • FastText EuroVoc classification models, based on Common Crawl FastText embeddings for most languages and CoRoLa embeddings for Romanian. Models available for multiple languages can be downloaded here. A modified FastText application allowing models to be interrogated online is available here.


Contextualized embeddings

  • RoBERT: There are two models available bert-base-romanian-cased-v1 and bert-base-romanian-uncased-v1 . A GitHub repo with useful scripts is available here . Related paper is Dumitrescu Stefan, Andrei-Marius Avram, and Sampo Pyysalo. "The birth of Romanian BERT." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4324-4328, 2020.
  • Romanian DistilBERT: Constructed based on the bert-base-romanian-cased-v1 model, the model is available on HuggingFace as distilbert-base-romanian-cased. A GitHub repo is available here.


Word Embeddings from the CoRoLa project

  • All word embeddings from the CoRoLa project can be downloaded and used interactively here.
  • The recommended model, according to a number of experiments, can be downloaded directly from here.


BioMedical Word Embeddings

These word embeddings were trained on the BioRo corpus ( Mitrofan, Maria and Tufiș, Dan. BioRo: The Biomedical Corpus for the Romanian Language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). pp. 1192-1196, 2018 ). The models have dimension 300 and were trained using FastText. Words occuring only 1,5 or 20 times were considered.