Named Entity Recognition

Selectați modelul NER dorit și introduceți un document în limba română pentru recunoașterea entităților. Modelele bazate pe corpusul LegalNERo presupun un text de tip legislație. Modelele bazate pe corpusul SiMoNERo presupun un text din domeniul biomedical. Utilizarea de texte din alte domenii reduce calitatea rezultatelor. Pentru a putea realiza ancorarea corectă a rezultatelor în textul introdus, nu sunt realizate curățări ale datelor. Textul trebuie să conțină doar caractere în limba română, cu virgula și punctul separate de cuvântul anterior.



LegalNERo is a manually annotated corpus for Romanian named entity recognition (NER) in the legal domain, available for download here: https://doi.org/10.5281/zenodo.4772094. If using this corpus, please cite as Păiș, Vasile, Mitrofan, Maria, Gasan, Carol Luca, Ianov, Alexandru, Ghiță, Corvin, Coneschi, Vlad Silviu, & Onuț, Andrei. (2021). Romanian Named Entity Recognition in the Legal domain (LegalNERo) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4772094

A brief annotation guidelines document is available here.

The NER system using LegalNERo can be cited as: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021

Model evaluation and Download
Models were constructed using pre-trained word embeddings and on-the-fly character embeddings fed into a BiLSTM layer. For certain models, additional gazetteer resources were employed. We used 80% of the data for training, 10% for validation during training and 10% for testing. Used splits can be downloaded from here: all classes and 4 classes (without LEGAL class). As word embeddings, we used CoRoLa based word embeddings as described in Vasile Păiș, and Dan Tufiș. “Computing distributed representations of words using the CoRoLa corpus”. In: Proceedings of the Romanian Academy Series A - Mathematics Physics Technical Sciences Information Science 19.2 (2018), pp. 185–191 (available for download here). Additional embeddings trained on the MARCELL legislative corpus are available here.
Experiments with other models not available in the demo can be seen here: Report1 and Report2.

LegalNERo_LEGAL_PER_LOC_ORG_TIME_Gazetteer Download
Trained using PharmaCoNER-Tagger (NeuroNER variant) with MARCELL embeddings and Gazetteer. Overall macro F1=85.34 on the test set. Individual F1 scores:

            LEGAL: 86.98
              LOC: 75.94
              ORG: 80.60
              PER: 98.48
             TIME: 84.09

LegalNERo_PER_LOC_ORG_TIME_Gazetteer Download
Trained using PharmaCoNER-Tagger (NeuroNER variant) with a combination of CoRoLa and MARCELL embeddings, Gazetteer and Affixes. Overall macro F1=86.84 on the test set. Individual F1 scores:
              LOC: 76.01
              ORG: 80.89
              PER: 98.86
             TIME: 91.39

LegalNERo_LEGAL_PER_LOC_ORG_TIME Download
Trained using NeuroNER with CoRoLa embeddings. Overall F1=84.00 on the test set. Individual F1 scores:
            LEGAL: 89.00
              LOC: 75.07
              ORG: 77.90
              PER: 95.56
             TIME: 87.32

LegalNERo_PER_LOC_ORG_TIME Download
Trained using NeuroNER with CoRoLa embeddings. Overall F1=84.70 on the test set. Individual F1 scores:
              LOC: 71.43
              ORG: 81.28
              PER: 97.73
             TIME: 89.77

Scientific papers