Romanian Language Resources Repository View all resources
- RODICA.tar.bz2 (3.16 Mb)
- Gifu,Daniela (2016). Lexical Semantics in Text Processing. Contrastive Diachronic Studies on Romanian Language. PhD Thesis, "Alexandru Ioan Cuza" University of Iași, Romania, https://profs.info.uaic.ro/~daniela.gifu/PhD%20Daniela%20Gifu%202016/PhD%20thesis%20Daniela%20Gifu%20final.pdf .[Download BibTex]
- Gîfu,Daniela and Simionescu,Radu (2018). Tracing Language Variation for Romanian. In 17th International Conference (CICLing), Revised Selected Papers, Part II, LNCS, Vol. 9624, 2018, Springer-Verlag Berlin Heidelberg, pages 599-610, https://link.springer.com/chapter/10.1007/978-3-319-75487-1_45 .[Download BibTex]
- Gifu,Daniela (2015). Contrastive Diachronic Study on Romanian Language. In Proceedings FOI-2015, pages 296-310, https://www.researchgate.net/publication/282017878_Contrastive_diachronic_study_on_Romanian_language .[Download BibTex]
- Gîfu,Daniela (2016). ROmanian DIachronic Corpus with Annotations (RODICA). Dataset, RELATE Repository, https://relate.racai.ro/repository/rodica .[Download BibTex]
RODICA (ROmanian DIachronic Corpus with Annotations) (Gifu, 2016) is a collection of publications written at the middle of the 19 th century in two countries, Romania and Republic of Moldavia. This corpus includes articles from four historical provinces (Moldova, Transylvania, Wallachia and Bessarabia), printed in the period 1817-2015, encrypted in Latin script. RODICA represents a first iteration towards building a Gold corpus for each region, centered on diachronic meta-annotation. Steps taken: First, the corpus was edited in PDF, so the boiling-plate technology was applied to obtain raw text in TXT format (UTF-8 encoding), using Java PDF Library - Apache PDFBox. Then several corrections were made on the raw texts. Second, the processing phase continues with: segmentation, tokenization, lemmatization, part-of-speech, and NotInDict Markup using the UAIC POS-Tagger (Gifu & Simionescu, 2018; Gifu, 2015).