A new classifier based on combination of genetic programming and support vector machine in solving imbalanced classification problem

In supervised learning, class imbalanced data set is a state where the class distribution is not uniform among the classes. Many classifiers fail to properly identify pattern that belongs to minority class due to most of those classifiers are built in order to minimize error rate. Hence, a biased...

Full description

Saved in:
Bibliographic Details
Main Author: Mohd Pozi, Muhammad Syafiq
Format: Thesis
Language:English
Published: 2016
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/69313/1/FSKTM%202016%204%20IR.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In supervised learning, class imbalanced data set is a state where the class distribution is not uniform among the classes. Many classifiers fail to properly identify pattern that belongs to minority class due to most of those classifiers are built in order to minimize error rate. Hence, a biased classification model is highly anticipated as higher accuracy can always be represented by majority class. There are two methods in dealing with imbalanced classification problem, which are based on data or algorithmic level. Data level based methods are meant to solve the imbalanced classification problem based on the idea of making both classes equal in number. However, by changing the distribution of both classes, the original classes distribution that are followed by that particular data will be violated. Algorithmic level based methods however are based on introducing new optimization task to improve the minority class classification rate, without changing the data characteristics. Nevertheless, the optimization task requires specific care in order to prevent the issue of overfitting classification model. Therefore, a new classifier based on genetic programming (GP) and support vector machine (SVM) is proposed in this thesis in order to solve the imbalanced classification problem without changing the data properties. The idea is to use GP to optimize the SVM decision function such that the minority class classification rate is increased without sacrificing the accuracy rate for both classes. In addition, the classifier is also optimized such that it has a good generalization property. The main keys of the new classifier are based on the new kernel method, new learning metric and a new optimization algorithm in order to optimize the SVM decision function. The proposed classifier is called Support Vector Genetic Programming Machine, SVGPM. In order to evaluate the performance of SVGPM against current methods in solving imbalanced classification task, three experiments are conducted such as on selected standard class imbalanced benchmark data sets, intrusion detection system (IDS) data set and remote sensing data set. The SVGPM performance is compared against SVM and cost-sensitive SVM due to the superiority of SVM in dealing with imbalanced classification problem. The second experiment is by evaluating the SVGPM performance on detecting anomalous rare attacks from network intrusion data set. The SVGPM performance is compared against current methods in developing a prediction model for IDS. In the third experiment, SVGPM is evaluated on wilt disease data set from remote sensing study, to identify wilt diseased trees in high-resolution image. The SVGPM performance is compared against the previously proposed methods in mapping the regions that are covered by wilt diseased trees in Japan. The carried out experimentation shown that SVGPM gives a very good classification rate in classifying minority class without sacrificing the accuracy rate for both classes. This is because, in the training stage, the introduced optimization task in SVGPM ensures that each minority class example is generalized into one learning concept and both classification rate for majority and minority classes are similar.