Morphological analyzer of the Tatar language

Morphological analyzer of the Tatar language

The Tatar language morphological analyzer is developed using the Helsinki Finite State Transducer (HFST) toolkit based on a two-level morphological language model (Koskenniemi, 1983).

The model recognizes 12 types of root affixes (part-of-speech, POS), 81 derivational and inflexive affixes, and 11 additional notations (punctuation marks etc.).

Also, the module can eliminate morphological ambiguity using hybrid approach including contextual rules and statistical probabilistic model.

Processing speed is about 10,000 tokens per second.

The code of the module is open (except rules and dictionaries) and can be found in bitbucket.org/yaugear/py_tat_morphan.

The morphological analyzer was used to annotate Tatar national corpus “Tugan Tel”, University Information System RUSSIA (UIS RUSSIA) and is used in the educational process of “Philology: applied philology” (45.03.01) at Kazan Federal University.

Related works:

Gataullin, R. Morphological Analysis System of the Tatar Language [Text] / Gataullin Ramil, Gilmullin Rinat // Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science. Springer, Cham. – Cyprus, Nicosia, 2017. – Vol. 10449. – P.519-528.
Khakimov, B. E. Context-Based Rules for Grammatical Disambiguation in the Tatar Language [Text] / Gataullin R. R., Khakimov B. E., Suleymanov D. Sh., Gilmullin R. A. // Computational Collective Intelligence. ICCCI 2017. Lecture Notes in Computer Science. Springer, Cham. – Cyprus, Nicosia, 2017. – Vol. 10449. – P.529-537.

Tatarstan Academy of Sciences

Institute of Applied Semiotics

Morphological analyzer of the Tatar language