RELATE

EUROVOC Classification

The PyEurovoc classification model is based on contextualized word embeddings, trained using bert-base-romanian-cased-v1. A GitHub repo with scripts and example usage is available here. Related paper is Avram Andrei-Marius, Vasile Păiș, and Dan Tufiș. "PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors." In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021. The model achieves F1=80.90 on EUROVOC IDs, while conversion to MT labels translates into F1=86.12 and for top-level domains P=88.40.

The FastText EuroVoc classification model for Romanian language was trained using FastText with CoRoLa based word embeddings. It is served using a changed version of FastText supporting serving trained models. It can be downloaded from our github: https://github.com/racai-ai/ServerFastText. The model achieves P=50.93, R=56.40, F1=53.53 on EUROVOC IDs, while conversion to MT labels translates into P=56.05, R=68.95, F1=61.83 and for top-level domains P=64.9, R=77.89, F1=70.80. The model can be downloaded here: BIN or VEC.