Proceedings of the Second International Conference on Arabic Language Resources and Tools

Summary of the paper

Title	Second Generation Tools (AMIRA 2.0): Fast and Robust Tokenization, POS tagging, and Base Phrase Chunking
Authors	Mona Diab
Abstract	In this paper, we address the problem of processing Modern Standard Arabic. We present the second generation of tools that process Arabic (AMIRA). AMIRA is a successor suite to the ASVMTools. The AMIRA toolkit includes a clitic tokenizer (TOK), part of speech tagger (POS) and base phrase chunker (BPC) - shallow syntactic parser. The technology of AMIRA is based on supervised learning with no explicit dependence on explicit modeling or knowledge of deep morphology. AMIRA is based on using a unified framework casting each of the component problems as a classification task. The underlying technology employs Support Vector Machines in a sequence modeling framework using the YAMCHA toolkit. The system is very fast and robust and allows for a number of variable user settings depending on the disambiguation granularity. The AMIRA toolkit has been widely used for different NLP (MT, IE, IR, NER, etc.) applications due to its speed and high performance.
Topics	Taggers and Parsers
Full paper	Second Generation Tools (AMIRA 2.0): Fast and Robust Tokenization, POS tagging, and Base Phrase Chunking
Bibtex	@InProceedings{DIAB09.56, author = {Mona Diab}, title = {Second Generation Tools (AMIRA 2.0): Fast and Robust Tokenization, POS tagging, and Base Phrase Chunking}, booktitle = {Proceedings of the Second International Conference on Arabic Language Resources and Tools}, year = {2009}, month = {April}, date = {22-23}, address = {Cairo, Egypt}, editor = {Khalid Choukri and Bente Maegaard}, publisher = {The MEDAR Consortium}, isbn = {2-9517408-5-9}, language = {english} }