Search In this Thesis
   Search In this Thesis  
العنوان
Web Documents Classification Using Text, Anchor, Title and Metadata Information/
الناشر
Mohamed Fathi;
المؤلف
Fathi,Mohamed.
هيئة الاعداد
باحث / محمد فتحى
مشرف / مجدى حسين ناجى
magdy.nagi@ieee.org
مشرف / نهى عدلى
noha.adly@gmail.com
مناقش / على على فهمى
مناقش / نجوى مصطفى المكى
nagwamakky@gmail.com
الموضوع
Web siles. Design.
تاريخ النشر
2004 .
عدد الصفحات
88 P.:
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
الهندسة (متفرقات)
تاريخ الإجازة
1/9/2004
مكان الإجازة
جامعة الاسكندريه - كلية الهندسة - حاسبات ونظم
الفهرس
Only 14 pages are availabe for public view

from 16

from 16

Abstract

The explosive growth in the amount of available documents on the World Wide Web makes it increasingly difficult to locate the required information in a short time. Furthermore, most of the available web documents are in the form of unstructured hypertext documents; which makes processing or indexing such documents a difficult task.
Text classifiers are used to predict the class or topic, that an unseen document might belong to, by consulting a knowledge base kept by the classifier. Using text classifiers in web domains can provide potential benefits for search engines, web crawlers, junk mail detectors and news extractors.
The traditional text classification techniques might be inappropriate io apply in the web context. These techniques do not take into consideration the rich information sources that usually exist in hypertext domains such as the anchor, title, metadata and neighboring documents information. Many web classification systems attempted to use these sources independently to add more information to the classifier. However, the integration between the text, anchor, title and metadata in the same web classification system was, according to our knowledge, limited.
In addition, many existing web classifiers are based on statistical methods that treat words as independent units ignoring the possible dependencies that might exist between them. It is believed that using these dependencies to identify unseen class documents could be more efficient than using isolated or independent words. Also, exploiting such dependencies might diminish the huge text features space into a reduced space of term groups.
In this research, the Association Rules Classifier (ARC) is proposed as a novel classification framework that captures different hypertext information sources namely the text, anchor, title and metadata information. The ARC uses this information to build a comprehensive knowledge base including features derived from the four sources in the form of strong association rules. The proposed approach takes into account the dependencies that exist between the class features instead of treating features as independent units as was the case with many other classifiers.
The performance of the ARC is compared with three other well-known full text classifiers: Bernoulli Bayes, Multinomial Bayes and K-Nearest Neighbors. The ARC has shown an improvement in the classification accuracy reaching 65% for large vocabulary size datasets. For small vocabulary size datasets, the ARC performance was similar to. the best classifier among the three. When compared to other techniques that exploit anchor, title and metadata information for classification, the ARC enhanced the classification accuracy by about 22%.