Pre-Trained Language Models
Annotation models for lemma, UPOS, XPOS and dependency parsing (where supported) trained on RRT UD 2.7.
- Stanza Download (756Mb)
- RNNTagger Download (668Mb)
- NLP-Cube Download (345Mb)
- UDPipe Download (13Mb)
- TreeTagger Download (1.4Mb)
- Scripts used in training and evaluating the models are available in our GitHub here.
- A working version of the TTL tool is available in the TEPROLIN service repository.
- For downloading the corpus visit the Universal Dependencies website or directly download UD 2.7 treebanks from http://hdl.handle.net/11234/1-3424.
- PyEuroVoc - Classification of legal documents using EuroVoc descriptors, based on BERT models, for 22 languages (Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Spanish, Slovak, Slovene, Swedish). A GitHub repo with scripts and example usage is available here. Related paper is Avram Andrei-Marius, Vasile Pai?, and Dan Tufi?. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021.
- FastText EuroVoc classification models, based on Common Crawl FastText embeddings for most languages and CoRoLa embeddings for Romanian. Models available for multiple languages can be downloaded here. A modified FastText application allowing models to be interrogated online is available here.
- RoBERT: There are two models available bert-base-romanian-cased-v1 and bert-base-romanian-uncased-v1 . A GitHub repo with useful scripts is available here . Related paper is Dumitrescu Stefan, Andrei-Marius Avram, and Sampo Pyysalo. "The birth of Romanian BERT." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4324-4328, 2020.
- Romanian DistilBERT: Constructed based on the bert-base-romanian-cased-v1 model, the model is available on HuggingFace as distilbert-base-romanian-cased. A GitHub repo is available here.
Word Embeddings from the CoRoLa project