Named Entity Recognition
LegalNERo is a manually annotated corpus for Romanian named entity recognition (NER) in the legal domain, available for download here: https://doi.org/10.5281/zenodo.4772094. If using this corpus, please cite as Păiș, Vasile, Mitrofan, Maria, Gasan, Carol Luca, Ianov, Alexandru, Ghiță, Corvin, Coneschi, Vlad Silviu, & Onuț, Andrei. (2021). Romanian Named Entity Recognition in the Legal domain (LegalNERo) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4772094
A brief annotation guidelines document is available here.
The NER system using LegalNERo can be cited as: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021
Model evaluation and Download
Models were constructed using pre-trained word embeddings and on-the-fly character embeddings fed into a BiLSTM layer. For certain models, additional gazetteer resources were employed. We used 80% of the data for training, 10% for validation during training and 10% for testing. Used splits can be downloaded from here: all classes and 4 classes (without LEGAL class). As word embeddings, we used CoRoLa based word embeddings as described in Vasile Păiș, and Dan Tufiș. “Computing distributed representations of words using the CoRoLa corpus”. In: Proceedings of the Romanian Academy Series A - Mathematics Physics Technical Sciences Information Science 19.2 (2018), pp. 185–191 (available for download here). Additional embeddings trained on the MARCELL legislative corpus are available here.
Experiments with other models not available in the demo can be seen here: Report1 and Report2.
Trained using PharmaCoNER-Tagger (NeuroNER variant) with MARCELL embeddings and Gazetteer. Overall macro F1=85.34 on the test set. Individual F1 scores:
LEGAL: 86.98 LOC: 75.94 ORG: 80.60 PER: 98.48 TIME: 84.09
Trained using PharmaCoNER-Tagger (NeuroNER variant) with a combination of CoRoLa and MARCELL embeddings, Gazetteer and Affixes. Overall macro F1=86.84 on the test set. Individual F1 scores:
LOC: 76.01 ORG: 80.89 PER: 98.86 TIME: 91.39
Trained using NeuroNER with CoRoLa embeddings. Overall F1=84.00 on the test set. Individual F1 scores:
LEGAL: 89.00 LOC: 75.07 ORG: 77.90 PER: 95.56 TIME: 87.32
Trained using NeuroNER with CoRoLa embeddings. Overall F1=84.70 on the test set. Individual F1 scores:
LOC: 71.43 ORG: 81.28 PER: 97.73 TIME: 89.77