Romanian Language Resources Repository View all resources
Gîfu, Daniela
"Alexandru Ioan Cuza" University of Iași
License:  CC BY-NC 4.0
Size:  151 annotated files
Please include one or more of the following references in your research work:[Download BibTex]
The "Dataset"-type reference is:

RODICA (ROmanian DIachronic Corpus with Annotations) (Gifu, 2016) is a collection of publications written at the middle of the 19 th century in two countries, Romania and Republic of Moldavia. This corpus includes articles from four historical provinces (Moldova, Transylvania, Wallachia and Bessarabia), printed in the period 1817-2015, encrypted in Latin script. RODICA represents a first iteration towards building a Gold corpus for each region, centered on diachronic meta-annotation. Steps taken: First, the corpus was edited in PDF, so the boiling-plate technology was applied to obtain raw text in TXT format (UTF-8 encoding), using Java PDF Library - Apache PDFBox. Then several corrections were made on the raw texts. Second, the processing phase continues with: segmentation, tokenization, lemmatization, part-of-speech, and NotInDict Markup using the UAIC POS-Tagger (Gifu & Simionescu, 2018; Gifu, 2015).