Title |
Arabic Part-Of-Speech Tagging using the Sentence Structure |
Authors |
Yahya Ould Mohamed El Hadj, Imad Abdulrahman Al-Sughayeir and Abdullah Mahdi Al-Ansari |
Abstract |
This work presents a system for Arabic Part-Of-Speech Tagginga that uses a combination between statistical and linguistic approaches, so that the processing will be performed in two levels. In the first level, text is firstly normalized and tokenized into words, and then morphologically analyzed. The morphological analysis is used as input module to reduce the size of the needed tags' lexicon by segmenting Arabic words in their prefixes, stems, and suffixes. This is very important due to the fact that Arabic is a derivational language. For this purpose, an appropriate tagging system has been proposed to represent the main Arabic part of speech in a hierarchical manner allowing an easy expansion whenever it is needed. In the second level, an appropriate statistical model based on the internal structure of the Arabic sentence is used to recognize the morphological characteristics of the words. The use of the linguistic internal structure of the Arabic sentence will allow us to identify logical sequences of words, and consequently their corresponding tags. Since the probability of a certain word (or its tag) occurrence depends on the words preceding it in a given context, the HMM will be the best suitable statistical model to keep track of this history. A linguistic study is conducted to determine the Arabic sentence structure by identifying the different main forms of both nominal and verbal sentences. Having done this, a HMM model is then used to represent this structure. Each state of the HMM is represented by a possible tag in the lexicon and the transitions between states (tags) are governed by the syntax of the sentence. Transition' probabilities are calculated using a smoothed tri-gram and a special processing is used to handle unknown words to determine their lexical probabilities. A corpus composed of old texts extracted from books of third century Hijri is created. A part of it is manually tagged and used to train and to test the tagger. Performance evaluation has shown an accuracy of 96%. |
Topics |
Taggers and Parsers |
Full paper |
Arabic Part-Of-Speech Tagging using the Sentence Structure |
Bibtex |
@InProceedings{ELHADJ09.5,
author = {Yahya Ould Mohamed El Hadj, Imad Abdulrahman Al-Sughayeir and Abdullah Mahdi Al-Ansari},
title = {Arabic Part-Of-Speech Tagging using the Sentence Structure},
booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools},
year = {2009},
month = {April},
date = {22-23},
address = {Cairo, Egypt},
editor = {Khalid Choukri and Bente Maegaard},
publisher = {The MEDAR Consortium},
isbn = {2-9517408-5-9},
language = {english}
} |