Search In this Thesis
   Search In this Thesis  
العنوان
Automatic multi-document summarization involving arabic language /
المؤلف
Elghannam, Fatma Rashad.
هيئة الاعداد
باحث / فاطمة رشاد الغنام
مشرف / طارق الششتاوي
مشرف / محمد شعراوي
مشرف / منى حافظ محمود
مناقش / منى فاطمة محمد مرسي
الموضوع
Automatic abstracting. Electronic information resources Abstracting and indexing. Information storage and retrieval systems.
تاريخ النشر
2014.
عدد الصفحات
162 P. :
اللغة
الإنجليزية
الدرجة
الدكتوراه
التخصص
الهندسة الكهربائية والالكترونية
تاريخ الإجازة
1/1/2014
مكان الإجازة
جامعة بنها - كلية الهندسة بشبرا - الهندسة الكهربائية
الفهرس
Only 14 pages are availabe for public view

from 180

from 180

Abstract

The aim of this thesis is to develop new techniques for automatic document summarization with a particular focus on Arabic multi-document summarization. The presented work used keyphrases as attributes to evaluate importance of the sentences and documents. To improve the keyphrase extraction and the summarization process, natural language analyses is used in our techniques. It is based on representing words in their lemma forms instead of original words. Therefore, we have implemented different modules to carry out the summarization process, this includes: First, the Arabic lemmatizer to generate the lemma form and extract (Part-Of-Speech) POS tagging and relevant morpho-syntactic features that support keyphrase extraction purposes. Second, the lemma-based Arabic keyphrase extractor (LBAKE) that identifies the important keyphrases based on statistical and linguistic features. The third is the single document summarizer that extracts summary from a single Arabic document. We proposed four different summarization heuristics, and showed through experiments that different keyphrase based scoring schemes can direct the proposed sentence extractor towards one or more of the summarization goals. The last is the multi-document summarizer that summarizes multiple documents. A centroid cluster scoring scheme was used to recognize the importance of a particular topic. We introduced two techniques for extracting summary sentences from multiple documents. The first (Sen-Rich), prefers to extract maximum richness sentences. While the other (Doc-Rich), seeks the most important document as a centroid document to extract the summary sentences. In both techniques keyphrases were used to assess sentences and documents. In this work, we proposed an addition to ROUGE test based on representing words in their lemma forms instead of original word. We conducted experiments to compare the accuracy against other systems and to test the new summarization techniques to summarize different documents types. The results showed that Sen-Rich technique tends to be useful when the documents are dealing with single event with limited number of topics and there is a highly condensed summary. The algorithm succeeded to capture sentences that carry the most important topics of the cluster. However, for a task of summarizing multiple documents with multiple numbers of topics, Doc-Rich technique tends to be more appropriate for better coverage and a cohesive readable text summary.