Title |
Bootstrapping Tagged Islamic Corpora |
Authors |
Mahmoud Shokrollahi-Far, Behrooz Minaei, Issa Barzegar, Hadi Hossein-Zade, Mojde Ghasdi and Seyed-salman Hoseini |
Abstract |
Among tagged language resources for Arabic there is a high density for Modern Standard Arabic. Nonetheless, the tagged corpora for Classical Arabic are of very low density. Moreover, such corpora are normally developed applying software that are of serious shortcomings. This paper is elaborating on the tagging approach of the Islamic corpora which are being tagged at Noorsoft, Qom, Iran, exploiting Mobin Expert System of Mahmoud Shokrollahi-Far at University College of Nabi-Akram, Tabriz, Iran. The system relying on the traditional grammar of Arabic where there are just three parts-of-speech, after tokenizing the phrasal words, bootstraps the grammatical tags in the corpora employing the vocalism in the vowelized texts. This gives the opportunity for the system to incorporate a tagset which is morpho-syntactically as diverse as possible. The prepared corpora to be tagged, being in a variety of Islamic genre, consist of 1G of phrasal words, whose tagged output is in xml format. |
Topics |
National and international activities and projects on Arabic, Guidelines, standards, specifications, models and best practices for Arabic LRs, Taggers and Parsers |
Full paper |
Bootstrapping Tagged Islamic Corpora |
Bibtex |
@InProceedings{SHOKROLLAHIFAR09.35,
author = {Mahmoud Shokrollahi-Far, Behrooz Minaei, Issa Barzegar, Hadi Hossein-Zade, Mojde Ghasdi and Seyed-salman Hoseini},
title = {Bootstrapping Tagged Islamic Corpora},
booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools},
year = {2009},
month = {April},
date = {22-23},
address = {Cairo, Egypt},
editor = {Khalid Choukri and Bente Maegaard},
publisher = {The MEDAR Consortium},
isbn = {2-9517408-5-9},
language = {english}
} |