Author: Fashwan, Amany Mohammed Saad Hassan./ Title: Automatic Diacritization of Modern Standard Arabic texts :

Search In this Thesis

العنوان

Automatic Diacritization of Modern Standard Arabic texts :

المؤلف

Fashwan, Amany Mohammed Saad Hassan.

هيئة الاعداد

باحث / أمانى محمد سعد حسن فاشوان

مشرف / سامح سعد أبو المجد الأنصارى

الموضوع

Linguistic. Phonetics and phonology.

تاريخ النشر

2016.

عدد الصفحات

148 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الصوتيات والموجات فوق الصوتية

تاريخ الإجازة

8/11/2016

مكان الإجازة

جامعة الاسكندريه - كلية الاداب - الصوتيات واللسانيات

الفهرس

Only 14 pages are availabe for public view

from

212

from

212

Abstract

The main objective of this thesis is to propose a system that helps in solving the problem of Modern Standard Arabic text diacritization by processing the input text to produce its fully corresponding diacritized one in which the words are diacritized both morphologically and syntactically. In order to fulfill the purposes of the study, the researcher reviews the importance of diacritization in Modern Standard Arabic text as well as the morphological and syntactic approaches and analyzers for diacritizing Arabic texts. The researcher also reviews the previous trials and approaches for building diacritization systems. For building the proposed diacritization system, the researcher depends on two processing levels; the morphological and syntactic processing levels in addition to some morpho-phonological rules. A hybrid approach of rule based and machine-learning techniques to morphological processing level has been adopted and an approach simulating the shallow parsing has been adopted for the syntactic processing level. The researcher uses an Arabic annotated corpus of 550,000 words; the International Corpus of Arabic (ICA) for extracting the Arabic linguistic rules, learning, validating the system and testing process. This corpus includes more than one source and different genres. In addition, it needs to be balanced and diverse as much as possible to get reliable results for the system. Some rules related to morphology, definiteness, case ending and morpho-phonology have been extracted, formalized in a generalized format and implemented in the system. Some machine learning techniques have been applied to select the best technique. OOV words have been dealt with. Buckwalter Arabic Morphological Analyzer is selected as a model for the morphological analysis. Preprocessing and editing stage is needed before analyzing the input text. The output results and limitations of the system are reviewed and the Word Error Rate (WER) and Diacritization Error Rate (DER) have been chosen to evaluate the system. The morphological diacritization WER achieved by the system is 4.56% and the syntactic diacritization WER achieved by the system is 9.71%. The results of the proposed system have been evaluated in comparison with the results of the best-known systems in the literature. Finally, the conclusion and future work are reviewed.