![]() | Only 14 pages are availabe for public view |
Abstract Data pre-processing represents one of the most crucial stages in data analysis. While developing novel machine learning techniques has been the main focus of research in the data science field, less attention has been given to data pre-processing. Discretizing continuous attributes is one essential and important data preprocessing step in data mining. There have been multiple efforts to propose discretization techniques with different characteristics. However, a clear pathway that can guide the choice of the needed discretization technique for different types of datasets is lacking. In this thesis, a taxonomy of discretization techniques was proposed based on the existence of class information and relationship between attributes in the analyzed dataset. The importance of discretization as a pre-processing step is also examined to demonstrate how it assists in achieving better classification performance compared to using continuous attributes. The performance of multiple parametric and non-parametric discretization methods in conjunction with a number of machine learning classifiers were applied to the problem of predicting Intensive Care Unit (ICU) mortality. The results demonstrate the significance of discretizing the input attributes in this problem where using discretized data achieved classification accuracy and F1 score of 89.19% and 0.38, respectively, while using continuous attributes achieved a classification accuracy and F1 score of 86.19% and 0.08, respectively. These results demonstrate that discretizing continuous attributes prior to applying machine learning models could result in significant performance enhancement. |