Principal component and multiple correspondence analysis for handling mixed variables in the smoothed location model

The issue of classifying objects into groups when the measured variables are mixtures of continuous and binary variables has attracted the attention of statisticians. Among the discriminant methods in classification, Smoothed Location Model (SLM) is used to handle data that contains both continuous...

Full description

Saved in:
Bibliographic Details
Main Author: Ngu, Penny Ai Huong
Format: Thesis
Language:eng
eng
Published: 2016
Subjects:
Online Access:https://etd.uum.edu.my/6034/1/s817094_01.pdf
https://etd.uum.edu.my/6034/2/s817094_02.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The issue of classifying objects into groups when the measured variables are mixtures of continuous and binary variables has attracted the attention of statisticians. Among the discriminant methods in classification, Smoothed Location Model (SLM) is used to handle data that contains both continuous and binary variables simultaneously. However, this model is infeasible if the data is having a large number of binary variables. The presence of huge binary variables will create numerous multinomial cells that will later cause the occurrence of large number of empty cells. Past studies have shown that the occurrence of many empty cells affected the performance of the constructed smoothed location model. In order to overcome the problem of many empty cells due to large number of measured variables (mainly binary), this study proposes four new SLMs by combining the existing SLM with Principal Component Analysis (PCA) and four types of Multiple Correspondence Analysis (MCA). PCA is used to handle large continuous variables whereas MCA is used to deal with huge binary variables. The performance of the four proposed models, SLM+PCA+Indicator MCA, SLM+PCA+Burt MCA, SLM+PCA+Joint Correspondence Analysis (JCA), and SLM+PCA+Adjusted MCA are compared based on the misclassification rate. Results of a simulation study show that SLM+PCA+JCA model performs the best in all tested conditions since it successfully extracted the smallest amount of binary components and executed with the shortest computational time. Investigations on a real data set of full breast cancer also showed that this model produces the lowest misclassification rate. The next lowest misclassification rate is obtained by SLM+PCA+Adjusted MCA followed by SLM+PCA+Burt MCA and SLM+PCA+Indicator MCA models. Although SLM+PCA+Indicator MCA model gives the poorest performance but it is still better than a few existing classification methods. Overall, the developed smoothed location models can be considered as alternative methods for classification tasks in handling large number of mixed variables, mainly the binary.