Binary logistic regression modelling with appropriate sample size in determining graduate employability factors for public universities in Malaysia

The performance of variable selection is essential to build an effective logistic regression model. Generally, p-values are used to identify significant variables or factors in the model. However, when dealing with real tracer study data for a country, the size of the data is typically large of whic...

全面介绍

Saved in:
书目详细资料
主要作者: Tengku Mohamed, Tengku Salbiah
格式: Thesis
语言:English
出版: 2020
主题:
在线阅读:http://eprints.utm.my/id/eprint/101869/1/TengkuSalbiahTengkuMohamedMFS2020.pdf
标签: 添加标签
没有标签, 成为第一个标记此记录!
实物特征
总结:The performance of variable selection is essential to build an effective logistic regression model. Generally, p-values are used to identify significant variables or factors in the model. However, when dealing with real tracer study data for a country, the size of the data is typically large of which causes the p-values to be deflated and affect the variable selection performance. Therefore, it is crucial to have an appropriate sample size and sampling ratio for this purpose. In this study, the appropriate sample size has been proposed based on simulated correlation tests and significant variables in order to improve the accuracy of variable selection. In addition, the sampling ratio in the response variable shows its best when it reflects the population ratio. Based on the proposed samples, the logistic regression model for graduate employability factor is subsequently proposed. It has been found that age, Cumulative Grade Point Average (CGPA), discipline of study, gender, state, and type of universities are the factors that significantly affect graduate employability among public universities in Malaysia. The results show that the proposed model has successfully improved the variable selection, model fitting, and classification accuracy as compared to the full model. Thus, by using a smaller sample size, the proposed model is able to maintain its statistical power in real data scenario by accurately selecting the significant factors.