Title |
Processing Large Arabic Text Corpora: Preliminary Analysis and Results |
Authors |
Fahad Alotaiby, Ibrahim Alkharashi and Salah Foda |
Abstract |
Important research areas such as Automatic Speech Recognition (ASR), Optical Character Recognition (OCR) and Information Retrieval (IR) heavily depend on the presence of a good statistical representation of the used language. A more precise representation leads to more accurate systems. On the other hand, Arabic is a quite richer and more complex language than English. This raises the need to study the key statistics of Arabic language and the statistical differences between Arabic and English on a large scale. For the purpose of this study, two large and comprehensive Arabic and English corpora are used. They are "Arabic Gigaword Third Edition" and "English Gigaword Third Edition" respectively. In this paper, we are going to use these two corpora to perform our preliminary analysis and show the results for Arabic language in conjunction with English. The aim of this paper is to present statistics about token and paragraph length distribution, punctuation marks and unigrams for Arabic and English. Preliminary processing considerations and issues are discussed throughout the paper. |
Topics |
Extraction and acquisition of knowledge (e.g. terms, lexical information, language modelling) from LRs, Guidelines, standards, specifications, models and best practices for Arabic LRs |
Full paper |
Processing Large Arabic Text Corpora: Preliminary Analysis and Results |
Bibtex |
@InProceedings{ALOTAIBY09.52,
author = {Fahad Alotaiby, Ibrahim Alkharashi and Salah Foda},
title = {Processing Large Arabic Text Corpora: Preliminary Analysis and Results},
booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools},
year = {2009},
month = {April},
date = {22-23},
address = {Cairo, Egypt},
editor = {Khalid Choukri and Bente Maegaard},
publisher = {The MEDAR Consortium},
isbn = {2-9517408-5-9},
language = {english}
} |