Pre-Trained Language Models

Annotation models for lemma, UPOS, XPOS and dependency parsing (where supported) trained on RRT UD 2.7.



Classification models

  • PyEuroVoc - Classification of legal documents using EuroVoc descriptors, based on BERT models, for 22 languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Spanish, Slovak, Slovene, Swedish). A GitHub repo with scripts and example usage is available here. Related paper is Avram Andrei-Marius, Vasile Pai?, and Dan Tufi?. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021.
  • FastText EuroVoc classification models, based on Common Crawl FastText embeddings for most languages and CoRoLa embeddings for Romanian. Models available for multiple languages can be downloaded here. A modified FastText application allowing models to be interrogated online is available here.


Contextualized embeddings

  • RoBERT: There are two models available bert-base-romanian-cased-v1 and bert-base-romanian-uncased-v1 . A GitHub repo with useful scripts is available here . Related paper is Dumitrescu Stefan, Andrei-Marius Avram, and Sampo Pyysalo. "The birth of Romanian BERT." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4324-4328, 2020.
  • Romanian DistilBERT: Constructed based on the bert-base-romanian-cased-v1 model, the model is available on HuggingFace as distilbert-base-romanian-cased. A GitHub repo is available here.


Word Embeddings from the CoRoLa project

  • All word embeddings from the CoRoLa project can be downloaded and used interactively here.
  • The recommended model, according to a number of experiments, can be downloaded directly from here.