Binary logistic regression modelling with appropriate sample size in determining graduate employability factors for public universities in Malaysia

The performance of variable selection is essential to build an effective logistic regression model. Generally, p-values are used to identify significant variables or factors in the model. However, when dealing with real tracer study data for a country, the size of the data is typically large of whic...

Full description

Saved in:
Bibliographic Details
Main Author: Tengku Mohamed, Tengku Salbiah
Format: Thesis
Language:English
Published: 2020
Subjects:
Online Access:http://eprints.utm.my/id/eprint/101869/1/TengkuSalbiahTengkuMohamedMFS2020.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The performance of variable selection is essential to build an effective logistic regression model. Generally, p-values are used to identify significant variables or factors in the model. However, when dealing with real tracer study data for a country, the size of the data is typically large of which causes the p-values to be deflated and affect the variable selection performance. Therefore, it is crucial to have an appropriate sample size and sampling ratio for this purpose. In this study, the appropriate sample size has been proposed based on simulated correlation tests and significant variables in order to improve the accuracy of variable selection. In addition, the sampling ratio in the response variable shows its best when it reflects the population ratio. Based on the proposed samples, the logistic regression model for graduate employability factor is subsequently proposed. It has been found that age, Cumulative Grade Point Average (CGPA), discipline of study, gender, state, and type of universities are the factors that significantly affect graduate employability among public universities in Malaysia. The results show that the proposed model has successfully improved the variable selection, model fitting, and classification accuracy as compared to the full model. Thus, by using a smaller sample size, the proposed model is able to maintain its statistical power in real data scenario by accurately selecting the significant factors.