Logistic regression methods for classification of imbalanced data sets

Classification of imbalanced data sets is one of the important researches in Data Mining community, since the data sets in many real-world problems mostly are imbalanced class distribution. This thesis aims to develop the simple and effective imbalanced classification algorithms by previously improv...

Full description

Saved in:
Bibliographic Details
Main Author: Santi Puteri Rahayu, -
Format: Thesis
Language:English
Published: 2012
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/3649/1/Logistic%20regression%20methods%20for%20classification%20of%20imbalanced%20data%20sets.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-ump-ir.3649
record_format uketd_dc
spelling my-ump-ir.36492023-03-23T06:15:11Z Logistic regression methods for classification of imbalanced data sets 2012-09 Santi Puteri Rahayu, - QA Mathematics Classification of imbalanced data sets is one of the important researches in Data Mining community, since the data sets in many real-world problems mostly are imbalanced class distribution. This thesis aims to develop the simple and effective imbalanced classification algorithms by previously improving the algorithms performance of general classifiers i.e. Kernel Logistic Regression Newton-Raphson (KLR-NR) and Regularized Logistic Regression NR (RLR-NR) which are Logistic Regression (LR)based methods. Both LR-based methods have strong statistical foundation and well known classifiers which have simple solution of unconstrained optimization problem in performing the good performance as well as Support Vector Machine (SVM) which is determined as state-of-the art classifier in Kernel methodology and Data Mining community. However, the imbalanced LR-based methods are not extensively developed such as imbalanced SVM-based methods. Hence, it is required to develop effective imbalanced LR-based methods to be widely used in data mining applications. Numerical results have showed that the use of Truncated Newton method for KLR-NR and RLR-NR which respectively resulted in Newton Truncated Regularized KLR (NTR-KLR) and NTR RLR (NTR-LR), is effective in handling the numerical problems on the huge matrix of linear system of Newton-Raphson update rule i.e. the training time and the singularity problem. These results can be seen as further explanation on the success of Truncated Newton method in TR-KLR and TR Iteratively Re-weighted Least Square (TR-IRLS) algorithm respectively, because of the equivalence of iterative method used by these algorithms. Moreover, only with the use of simple solution of unconstrained optimization problem, numerical results have demonstrated that proposed NTR-KLR and proposed NTR-LR respectively have comparable classification performance with RBFSVM (SVM with Radial Basis Function Kernel). The imbalanced problem of both proposed general classification algorithms which is the limitation of accuracy performance specifically in classifying on the minority class has motivated this research to improve their classification performance on imbalanced data sets. In general, numerical results have showed that the use of adapted Modified AdaBoost methods for NTR-KLR and NTR-LR which respectively resulted in AdaBoost NTR Weighted KLR (AB-WKLR) and AB NTR Weighted RLR (AB-WLR) is significantly successful in improving the accuracy and stability performance of general classifiers i.e. NTR-KLR and NTR-LR respectively. The improvements on both error by g-means and standard deviation of g-means with 5-Fold SCV could be achieved as high as more than 60. Furthermore, numerical results have demonstrated that proposed AB-WKLR and proposed AB-WLR respectively have comparable performances with AdaBoostSVM in classifying imbalanced data sets, only with the use of simple solution of unconstrained weighted optimization problem. Thus, both proposed imbalanced LR-based methods is simple and effective for classification of imbalanced data sets and have promising results. 2012-09 Thesis http://umpir.ump.edu.my/id/eprint/3649/ http://umpir.ump.edu.my/id/eprint/3649/1/Logistic%20regression%20methods%20for%20classification%20of%20imbalanced%20data%20sets.pdf pdf en public masters Universiti Malaysia Pahang Faculty of Computer System & Software Engineering R.J Asnim, Ohamadz Zain
institution Universiti Malaysia Pahang Al-Sultan Abdullah
collection UMPSA Institutional Repository
language English
advisor R.J Asnim, Ohamadz Zain
topic QA Mathematics
spellingShingle QA Mathematics
Santi Puteri Rahayu, -
Logistic regression methods for classification of imbalanced data sets
description Classification of imbalanced data sets is one of the important researches in Data Mining community, since the data sets in many real-world problems mostly are imbalanced class distribution. This thesis aims to develop the simple and effective imbalanced classification algorithms by previously improving the algorithms performance of general classifiers i.e. Kernel Logistic Regression Newton-Raphson (KLR-NR) and Regularized Logistic Regression NR (RLR-NR) which are Logistic Regression (LR)based methods. Both LR-based methods have strong statistical foundation and well known classifiers which have simple solution of unconstrained optimization problem in performing the good performance as well as Support Vector Machine (SVM) which is determined as state-of-the art classifier in Kernel methodology and Data Mining community. However, the imbalanced LR-based methods are not extensively developed such as imbalanced SVM-based methods. Hence, it is required to develop effective imbalanced LR-based methods to be widely used in data mining applications. Numerical results have showed that the use of Truncated Newton method for KLR-NR and RLR-NR which respectively resulted in Newton Truncated Regularized KLR (NTR-KLR) and NTR RLR (NTR-LR), is effective in handling the numerical problems on the huge matrix of linear system of Newton-Raphson update rule i.e. the training time and the singularity problem. These results can be seen as further explanation on the success of Truncated Newton method in TR-KLR and TR Iteratively Re-weighted Least Square (TR-IRLS) algorithm respectively, because of the equivalence of iterative method used by these algorithms. Moreover, only with the use of simple solution of unconstrained optimization problem, numerical results have demonstrated that proposed NTR-KLR and proposed NTR-LR respectively have comparable classification performance with RBFSVM (SVM with Radial Basis Function Kernel). The imbalanced problem of both proposed general classification algorithms which is the limitation of accuracy performance specifically in classifying on the minority class has motivated this research to improve their classification performance on imbalanced data sets. In general, numerical results have showed that the use of adapted Modified AdaBoost methods for NTR-KLR and NTR-LR which respectively resulted in AdaBoost NTR Weighted KLR (AB-WKLR) and AB NTR Weighted RLR (AB-WLR) is significantly successful in improving the accuracy and stability performance of general classifiers i.e. NTR-KLR and NTR-LR respectively. The improvements on both error by g-means and standard deviation of g-means with 5-Fold SCV could be achieved as high as more than 60. Furthermore, numerical results have demonstrated that proposed AB-WKLR and proposed AB-WLR respectively have comparable performances with AdaBoostSVM in classifying imbalanced data sets, only with the use of simple solution of unconstrained weighted optimization problem. Thus, both proposed imbalanced LR-based methods is simple and effective for classification of imbalanced data sets and have promising results.
format Thesis
qualification_level Master's degree
author Santi Puteri Rahayu, -
author_facet Santi Puteri Rahayu, -
author_sort Santi Puteri Rahayu, -
title Logistic regression methods for classification of imbalanced data sets
title_short Logistic regression methods for classification of imbalanced data sets
title_full Logistic regression methods for classification of imbalanced data sets
title_fullStr Logistic regression methods for classification of imbalanced data sets
title_full_unstemmed Logistic regression methods for classification of imbalanced data sets
title_sort logistic regression methods for classification of imbalanced data sets
granting_institution Universiti Malaysia Pahang
granting_department Faculty of Computer System & Software Engineering
publishDate 2012
url http://umpir.ump.edu.my/id/eprint/3649/1/Logistic%20regression%20methods%20for%20classification%20of%20imbalanced%20data%20sets.pdf
_version_ 1783731902881988608