The TEPROLIN Web Service

Radu Ion (radu@racai.ro)

Introduction

The TEPROLIN Web Service (WS) was developed and is maintained in the ReTeRom project. The backend is the TEPROLIN text preprocessing platform that incorporates several NLP applications for which it provides a unified access interface as a Python 3 object.

TEPROLIN currently offers 15 text preprocessing operations for Romanian, 13 of which are described in (Ion, 2018). These are:

text-normalization

diacritics-restoration

word-hyphenation

word-stress-identification

word-phonetic-transcription

numeral-rewriting

abbreviation-rewriting

sentence-splitting

tokenization

pos-tagging

lemmatization

named-entity-recognition

biomedical-named-entity-recognition

chunking

dependency-parsing

Configuration options

The GET queries will request for configuration information. Assuming that the WS is running on http://127.0.0.1:5000,

curl http://127.0.0.1:5000/operations

will return a JSON object with the list of 15 operations mentioned above:

A GET request with one of the TEPROLIN's operations, e.g.

curl http://127.0.0.1:5000/apps/pos-tagging

will return the JSON object with the list of the NLP apps that can perform it:

The first NLP app is the default app to execute the operation. In the example above, pos-tagging is executed with nlp-cube-adobe.

Here is the complete list of NLP apps that TEPROLIN currently incorporates, for each operation:

text-normalization

tnorm-icia: an in-house developed Python 3 class that replaces old Romanian diacritics (ş and ţ) with their new variants (ș and ț), removes multiple spaces and normalizes the dash chars.

diacritics-restoration

diac-restore-icia: an in-house developed diacritic restoration algorithm based on word n-grams and Viterbi decoding. Developed by Tiberiu Boroș in Java, it has been ported to Python 3 and included in TEPROLIN.

word-hyphenation

mlpla-icia: developed in Java by Tiberiu Boroș et al. (2018).

word-stress-identification

mlpla-icia

word-phonetic-transcription

mlpla-icia

numeral-rewriting

mlpla-icia: developed in Java by Radu Ion et al. (2020) but integrated into mlpla-icia application.

abbreviation-rewriting

mlpla-icia

sentence-splitting

ttl-icia: provided by the TTL Perl module (Ion, 2007).
nlp-cube-adobe: provided by the NLP-Cube Python 3 module (Boroș et al., 2018).
udpipe-ufal: provided by the UDPipe 1 Python 3 module (Straka et al., 2016).

tokenization

ttl-icia
nlp-cube-adobe
udpipe-ufal

pos-tagging

ttl-icia
nlp-cube-adobe
udpipe-ufal

lemmatization

ttl-icia
nlp-cube-adobe
udpipe-ufal

named-entity-recognition

ner-icia: provided by the web service developed by Vasile Păiș, available in this NER interface.

biomedical-named-entity-recognition

bioner-icia: provided by a previous version of the NLP-Cube Python 3 module (Boroș et al., 2018).

chunking

ttl-icia

dependency-parsing

nlp-cube-adobe
udpipe-ufal

Annotating text

In order to annotate text, you will send POST requests to the /process URL. TEPROLIN is a REST WS, meaning that there is not any saving happening between requests. If you want to use a different NLP app for a given operation, you should send the configuration option along with the text to be processed. For a full list of what operations can be executed with what NLP apps, see the previous section.

The POST request is typed with the application/x-www-form-urlencoded MIME type. The body of the request must contain only the following key=value pairs, concatenated with the & character:

text=text to be annotated here...

<operation>=<NLP app> (e.g. pos-tagging=ttl-icia)
and
exec=<operation>,<operation>,...

If exec is present, then the requested operations are performed in the proper order (the client need not bother with the order). TEPROLIN will infer the order of function calls and the modules to run such that the requested annotations are returned to the client. If exec is not present, then the full processing chain is executed (all 15 operations).

If any configuration option is present, then the specified operation(s) will be performed with the requested NLP app(s) (e.g. pos-tagging is performed with the ttl-icia NLP app).

Finally, text is the only key that is required and which contains the text to be processed.

The returned JSON object

TEPROLIN WS will respond with a JSON object containing two keys:

teprolin-conf: contains the active configuration that produced the result, in the form of <operation>: <NLP app> pairs;
teprolin-result: contains the text annotation or, if an error occurred, the error message. Enclosed, we find the following keys:
- text: is the text that has been normalized, including here the automatic insertion of diacritics;
- sentences: is the list of sentence strings that have been detected in text;
- tokenized: contains the list of JSON objects for each token in each sentence. A JSON token has the following attributes:
  - _id: the index of the token in the sentece, 1-based numbering;
  - _wordform: the occurrence of the word in the sentence;
  - _ctg: the corpus (reduced) POS tag of the word;
  - _msd: the MSD (full) POS tag of the word;
  - _lemma: the lemma of the _wordform;
  - _head: the head of this token in the dependency analysis tree;
  - _deprel: the name of the dependency relation between this token and its head;
  - _expand: if the token is a number or an abbreviation, its expanded literal form is given here;
  - _chunk: the chunk(s) in which the token is included;
  - _ner: the named entity annotation of the token (one of the LOCation, PERson or ORGanization);
  - _bner: the biomedical named entity inside–outside–beginning annotation of the token (one of the B-DISO|I-DISOorder, B-ANAT|I-ANATomy, B-PROC|I-PROCedure or B-CHEM|I-CHEMical);
  - _phon: the phonetic transcription of the _wordform; phonemes are separated by '.';
  - _syll: the syllables of the _wordform; syllables are separated by '.' and the stressed syllable is marked with '.

For example, the output for the command


curl http://127.0.0.1:5000/process -d "text=Diabetul zaharat se remarca prin valori crescute ale concentratiei glucozei in sange." -d "exec=biomedical-named-entity-recognition"

is the following:

Getting statistics about platform usage

The TEPROLIN platform can offer statistics about the following types of events:

annotated tokens: depending on the specified time interval, the number of annotated tokens is returned, for each time period;
received GET or POST requests: the number of requests received by the platform.

In order to get frequency information of the above-mentioned events, you will send GET requests to the /stats URL prefix. To obtain the full URL, you must append a statistics type (one of the tokens or requests), a time period (one of the year, month or day) and a size of the history to retrieve, an integer.

For example, to get a break-down of the number of tokens processed in the past 5 days (including the present day), you would query like this:


curl http://127.0.0.1:5000/stats/tokens/day/5

In order to get the number of requests for the current month, send this query:


curl http://127.0.0.1:5000/stats/requests/month/1

TEPROLIN will respond with a JSON object that contains the list of counts for the specified statistics type. For the first request, the response looks like this:

References

Tiberiu Boroș, Ștefan Daniel Dumitrescu and Ruxandra Burtica. (2018). NLP-Cube: End-to-End Raw Text Processing With Neural Networks. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. pp. 171--179. October 2018

Tiberiu Boroș, Ștefan Daniel Dumitrescu and Vasile Păiș. (2018). Tools and resources for Romanian text-to-speech and speech-to-text applications. arXiv:1802.05583v1 [cs.CL]

Radu Ion. (2018). TEPROLIN: An Extensible, Online Text Preprocessing Platform for Romanian. In Proceedings of the International Conference on Linguistic Resources and Tools for Processing Romanian Language (ConsILR 2018), November 22-23, 2018, Iași, România.

Radu Ion, Badea V. G., Cioroiu G., Barbu Mititelu V., Irimia E., Mitrofan M. and Tufiș D. (2020). A Dialog Manager for Micro-Worlds. Studies in Informatics and Control, 29(4) 401--410, December 2020. ISSN: 1220-1766

Milan Straka, Jan Hajič and Jana Straková. (2016). UD-Pipe: trainable pipeline for processing CoNLL-Ufiles performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, Portorož, Slovenia.