الفهرس | Only 14 pages are availabe for public view |
Abstract In recent years, the rapid developments in the genetics field have generated a huge amount of biological data. Microarray gene expression data is an important instance of biological data. It has high dimensionality with a small number of samples accompanied with large number of genes. Therefore, using machine learning techniques for knowledge discovery in such data become a rich area for researchers. The mining phase is usually divided into two steps: the gene selection (feature reduction) and the classification process. Gene selection is a process of finding the genes most strongly related to a particular class. The benefit of this process is to reduce not only dimensionality but also, the danger of presence of irrelevant genes that affect the classification process. Many machine learning approaches are used feature reduction but the study focuses on t-test and class separability. In the other hand, classification is an important data-mining problem that has a wide range of applications. Classification concerns learning that classifies data into the predetermined categories. It is applied to discriminate diseases or to predict outcomes based on gene expression patterns and perhaps even identify the best treatment for given genetic signature. Many machine learning approaches are used classification. In this study, it focuses on Support vector machine and k-nearest neighbor. Support Vector Machine (SVM) plays a very important role in the data mining classification problem. The structure of SVM depends on kernel functions, where the most commonly used are liner and polynomial. If there are more than two classes in the data set, binary SVMs are not sufficient to - III - solve the whole problem. To solve multi-class classification problems, the whole problem should be converted into a number of binary classification problems. Usually, there are two approaches. One is the “one against all” scheme and the other is the “one against one” scheme. On the other hand, K-Nearest Neighbor shows an outstanding performance in many cases of classifying microarray gene expression. For using KNN technique three key elements are essential, (1) a set of data for training, (2) a group of labels for the training data (identifying the class of each data entry) and (3) the value of K for deciding the number of nearest neighbors. This study proposes a new hybrid reduction approach for the promotion of the cancer classification accuracy that uses two gene selection techniques to confirm the most informative genes and to discard irrelevant genes that affect the classification accuracy. Actually, it applied two machine learning (ML) gene ranking techniques (T-test and Class Separability (CS)) and two ML classifiers; K-nearest neighbor (KNN) and support vector machine (SVM); for exploring and analyzing the process of mining microarray gene expression profiles. In addition, based on these analyses we proposed a hybrid ML reduction approach to enhance the classification accuracy. It has tested and validated the ML approaches on four public microarray databases; Lymphoma, Leukemia, Small Round Blue Cell Tumors (SRBCT) and Lung Cancer datasets. The experimental results show that the hybrid system achieves enhancement in the classification accuracy better than the SVM and KNN techniques alone. Also, selecting genes from the whole data is better than selecting it from the training data. But excluding the testing samples from the classifier building process, make it more accurately to compare the performance and it make a validation for the system. |