Embedded feature selection methods with high dimensionality for elastic net and logistic regression models

Feature selection and classification in high-dimensional data is a challenging problem in scientific research such as biology, medicine, and finance. In such data, highly correlated features and missing data often exist. Therefore, selecting informative features and adequate handling of missing valu...

Full description

Saved in:
Bibliographic Details
Main Author: Alharthi, Aiedh Mrisi
Format: Thesis
Language:English
Published: 2022
Subjects:
Online Access:http://eprints.utm.my/id/eprint/102313/1/AiedhMrisiAlharthiPFS2022.pdf.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.102313
record_format uketd_dc
spelling my-utm-ep.1023132023-08-17T01:08:11Z Embedded feature selection methods with high dimensionality for elastic net and logistic regression models 2022 Alharthi, Aiedh Mrisi QA Mathematics Feature selection and classification in high-dimensional data is a challenging problem in scientific research such as biology, medicine, and finance. In such data, highly correlated features and missing data often exist. Therefore, selecting informative features and adequate handling of missing values are significant to find an optimal model in terms of interpretability and prediction accuracy. In recent years, embedded feature selection methods, including penalized regression, have attracted many statisticians since these methods often obtain model estimates with higher prediction accuracy. Nevertheless, most penalized methods lack the consistency of feature selection, encouragement of grouping effects, and handling missing values when dealing with high-dimensional data. Hence, this study aims to improve the process of feature selection and handling of missing values by proposing several improvements in the penalized high-dimensional approaches. An alternative initial weight was introduced in the adaptive least absolute shrinkage and selection operator (LASSO) to improve the feature selection performance. Then, an initial ratio and adjusted variance weights inside the ??1-norm penalty of the adaptive elastic net are proposed to encourage the grouping effect. Furthermore, imputation penalized logistic regression with the adaptive LASSO approach was proposed to enhance the handling of missing values in high-dimensional data. Simulation studies with varying numbers of predictor variables, sample sizes, correlation coefficients, and the proportion of missing values were performed to evaluate the effectiveness of the proposed methods. The proposed adaptive LASSO methods were also compared with LASSO and other versions of adaptive LASSO methods, while the proposed adaptive elastic net methods were compared with the existing elastic net and adaptive elastic net methods. The proposed methods were also applied to a chemometrics dataset and eight gene expression microarray datasets in which the number of genes (features) is more than the sample size. The results indicated that the proposed methods outperform their competitors in selecting the most relevant features and achieving higher classification accuracy, sensitivity, and specificity values. It also reduces dimensionality and selects the most helpful features for cancer classification, resulting in optimal models that concurrently perform feature selection and patient classification. On the other hand, the proposed adaptive elastic net method is shown superior to the other methods in terms of encouraging the group effect. In conclusion, this study shows that the proposed methods are appropriate for gene expression data classification and other high-dimensional data classification analyses. 2022 Thesis http://eprints.utm.my/id/eprint/102313/ http://eprints.utm.my/id/eprint/102313/1/AiedhMrisiAlharthiPFS2022.pdf.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:149202 phd doctoral Universiti Teknologi Malaysia Faculty of Science
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA Mathematics
spellingShingle QA Mathematics
Alharthi, Aiedh Mrisi
Embedded feature selection methods with high dimensionality for elastic net and logistic regression models
description Feature selection and classification in high-dimensional data is a challenging problem in scientific research such as biology, medicine, and finance. In such data, highly correlated features and missing data often exist. Therefore, selecting informative features and adequate handling of missing values are significant to find an optimal model in terms of interpretability and prediction accuracy. In recent years, embedded feature selection methods, including penalized regression, have attracted many statisticians since these methods often obtain model estimates with higher prediction accuracy. Nevertheless, most penalized methods lack the consistency of feature selection, encouragement of grouping effects, and handling missing values when dealing with high-dimensional data. Hence, this study aims to improve the process of feature selection and handling of missing values by proposing several improvements in the penalized high-dimensional approaches. An alternative initial weight was introduced in the adaptive least absolute shrinkage and selection operator (LASSO) to improve the feature selection performance. Then, an initial ratio and adjusted variance weights inside the ??1-norm penalty of the adaptive elastic net are proposed to encourage the grouping effect. Furthermore, imputation penalized logistic regression with the adaptive LASSO approach was proposed to enhance the handling of missing values in high-dimensional data. Simulation studies with varying numbers of predictor variables, sample sizes, correlation coefficients, and the proportion of missing values were performed to evaluate the effectiveness of the proposed methods. The proposed adaptive LASSO methods were also compared with LASSO and other versions of adaptive LASSO methods, while the proposed adaptive elastic net methods were compared with the existing elastic net and adaptive elastic net methods. The proposed methods were also applied to a chemometrics dataset and eight gene expression microarray datasets in which the number of genes (features) is more than the sample size. The results indicated that the proposed methods outperform their competitors in selecting the most relevant features and achieving higher classification accuracy, sensitivity, and specificity values. It also reduces dimensionality and selects the most helpful features for cancer classification, resulting in optimal models that concurrently perform feature selection and patient classification. On the other hand, the proposed adaptive elastic net method is shown superior to the other methods in terms of encouraging the group effect. In conclusion, this study shows that the proposed methods are appropriate for gene expression data classification and other high-dimensional data classification analyses.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Alharthi, Aiedh Mrisi
author_facet Alharthi, Aiedh Mrisi
author_sort Alharthi, Aiedh Mrisi
title Embedded feature selection methods with high dimensionality for elastic net and logistic regression models
title_short Embedded feature selection methods with high dimensionality for elastic net and logistic regression models
title_full Embedded feature selection methods with high dimensionality for elastic net and logistic regression models
title_fullStr Embedded feature selection methods with high dimensionality for elastic net and logistic regression models
title_full_unstemmed Embedded feature selection methods with high dimensionality for elastic net and logistic regression models
title_sort embedded feature selection methods with high dimensionality for elastic net and logistic regression models
granting_institution Universiti Teknologi Malaysia
granting_department Faculty of Science
publishDate 2022
url http://eprints.utm.my/id/eprint/102313/1/AiedhMrisiAlharthiPFS2022.pdf.pdf
_version_ 1776100893617291264