Author: Saleh, Amira Samy Abd El-Hay Badran./ Title: Error-correction in genomic data /

Search In this Thesis

العنوان

Error-correction in genomic data /

المؤلف

Saleh, Amira Samy Abd El-Hay Badran.

هيئة الاعداد

باحث / أميرة سامى عبدالحى بدران صالح

مشرف / مجدي زكريا رشاد

مشرف / ساره السيد المتولي

مناقش / حمد هاشم عبدالعزيز أحمد

مناقش / شاهنده صلاح الدين سرحان

الموضوع

Computer Science.

تاريخ النشر

2024.

عدد الصفحات

127 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

01/01/2024

مكان الإجازة

جامعة المنصورة - كلية الحاسبات والمعلومات - قسم علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

126

from

126

Abstract

The rapid advancement of Next-Generation Sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality, as errors become more prevalent. This introduces the need to utilize different errors detection and filtration approaches and the task of data quality assurance is moved from the hardware space to the software pre-processing stages. In this thesis, MAC-ErrorReads is introduced as a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, where reads are categorized as either ‘1’ for erroneous or ‘0’ for correct. We employed five supervised machine learning algorithms, including Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), Logistic Regression (LR), and eXtreme Gradient Boosting (XGBoost), which were trained and tested using both simulated and real datasets from E. coli, GAGE S. aureus, H. chr14, Arabidopsis thaliana and Metriaclima zebra. These algorithms were trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF-IDF) values for identified k-mers from the sequencing data, eliminating the need for costly pre-processing stages. Notably, Naive Bayes exhibited strong performance results, achieving an accuracy of 0.91, precision of 1, recall of 0.91, and an F1-score of 0.95 for the E. coli dataset, while reaching an accuracy of 1 for the GAGE S. aureus dataset, for H. chr14 dataset the NB model achieving accuracy 0.98, precision 0.98, recall 0.98, and F1-score 0.98, additionally, MCC of 0.96 and an ROC of 0.98, for Arabidopsis thaliana dataset The NB model achieved an accuracy of 0.99, precision of 0.99, recall of 0.98, F1-score of 0.99, MCC of 0.98 and a ROC of 0.99, while reaching an accuracy of 0.96, precision of 0.97, recall of 0.96, F1-score of 0.96, MCC of 0.93 and a ROC of 0.96 for the Metriaclima zebra Dataset. The correctly classified reads from the MAC-ErrorReads NB model for various dataset are compared against reads produced by different benchmarking error correction tools such as Lighter, BFC, RECKONER, Fiona, Karect, Pollux, and CARE. The reads are aligned to their corresponding reference genome and the evaluation metrics are reported in terms of alignment statistics. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome.