Pre-Trained Language Models
Annotation models for lemma, UPOS, XPOS and dependency parsing (where supported) trained on RRT UD 2.7.These models were evaluated in Păiș, Vasile and Ion, Radu and Avram, Andrei-Marius and Mitrofan, Maria and Tufiș, Dan. In-depth evaluation of Romanian natural language processing pipelines. In Romanian Journal of Information Science and Technology (ROMJIST). vol. 24, no. 4, pp. 384--401, 2021. The article can be accessed here.
- Stanza Download (756Mb)
- RNNTagger Download (668Mb)
- NLP-Cube Download (345Mb)
- UDPipe Download (13Mb)
- TreeTagger Download (1.4Mb)
- Scripts used in training and evaluating the models are available in our GitHub here.
- A working version of the TTL tool is available in the TEPROLIN service repository.
- For downloading the corpus visit the Universal Dependencies website or directly download UD 2.7 treebanks from http://hdl.handle.net/11234/1-3424.
- PyEuroVoc - Classification of legal documents using EuroVoc descriptors, based on BERT models, for 22 languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Spanish, Slovak, Slovene, Swedish). A GitHub repo with scripts and example usage is available here. Related paper is Avram Andrei-Marius, Vasile Păiș, and Dan Tufiș. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021.
- FastText EuroVoc classification models, based on Common Crawl FastText embeddings for most languages and CoRoLa embeddings for Romanian. Models available for multiple languages can be downloaded here. A modified FastText application allowing models to be interrogated online is available here.
- RoBERT: There are two models available bert-base-romanian-cased-v1 and bert-base-romanian-uncased-v1 . A GitHub repo with useful scripts is available here . Related paper is Dumitrescu Stefan, Andrei-Marius Avram, and Sampo Pyysalo. "The birth of Romanian BERT." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4324-4328, 2020.
- Romanian DistilBERT: Constructed based on the bert-base-romanian-cased-v1 model, the model is available on HuggingFace as distilbert-base-romanian-cased. A GitHub repo is available here.
Word Embeddings from the CoRoLa project
- All word embeddings from the CoRoLa project can be downloaded and used interactively here.
- The recommended model, according to a number of experiments, can be downloaded directly from here.
BioMedical Word EmbeddingsThese word embeddings were trained on the BioRo corpus ( Mitrofan, Maria and Tufiș, Dan. BioRo: The Biomedical Corpus for the Romanian Language. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). pp. 1192-1196, 2018 ). The models have dimension 300 and were trained using FastText. Words occuring only 1,5 or 20 times were considered.