Summary of the paper

Title A Hybrid System for Automatic Arabic Diacritization
Authors Mohsen Rashwan, Mohammad Al-Badrashiny, Mohamed Attia and Sherif Abdou
Abstract This paper introduces a two-layer stochastic system to automatically diacritize raw Arabic text that is known to be quite a tough problem. The first layer tries to decide about the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via long A* lattice search and m-gram probability estimation. When full-form words happen to be out-of-vocabulary, the second layer is resorted to. This second layer factorizes each Arabic word into its possible morphological constituents (prefix, root, pattern and suffix), then uses m-gram probability estimation and A* lattice search to select among the possible factorizations to get the most likely diacritizations sequence. While the second layer has the advantage of excellent coverage over the Arabic language, the first layer enjoys a better disambiguation for the same size of training corpora especially for inferring syntactical (case-based) diacritics. The presented hybrid system enjoys the advantages of both layers. After a background on Arabic morphology & PoS tagging, the paper details the workings of both layers and the architecture of the hybrid system.
Topics National and international activities and projects on Arabic,
Industrial use of LRs
Full paper A Hybrid System for Automatic Arabic Diacritization
Bibtex @InProceedings{RASHWAN09.36,
  author = {Mohsen Rashwan, Mohammad Al-Badrashiny, Mohamed Attia and Sherif Abdou},
  title = {A Hybrid System for Automatic Arabic Diacritization},
  booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools},
  year = {2009},
  month = {April},
  date = {22-23},
  address = {Cairo, Egypt},
  editor = {Khalid Choukri and Bente Maegaard},
  publisher = {The MEDAR Consortium},
  isbn = {2-9517408-5-9},
  language = {english}
  }

Powered by ELDA © 2009 The MEDAR Consortium