Missing data imputation framework for early childhood longitudinal data: a study case on NCDRC data
This research aims to develop an imputation framework for the National ChildhoodDevelopment Research Centre (NCDRC)s missing data. Missing data and other associatedissues, such as outliers, time points, noise, and continuity, were the main challenges in thisresearch. The nature of the NCDRC dataset...
Saved in:
Main Author: | |
---|---|
Format: | thesis |
Language: | eng |
Published: |
2019
|
Subjects: | |
Online Access: | https://ir.upsi.edu.my/detailsg.php?det=6762 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | This research aims to develop an imputation framework for the National ChildhoodDevelopment Research Centre (NCDRC)s missing data. Missing data and other associatedissues, such as outliers, time points, noise, and continuity, were the main challenges in thisresearch. The nature of the NCDRC dataset was not consistent with those reported in theliterature, with the latter being more randomly scattered and copious and having nopatterns, making it difficult to find and select relevant experimental data. The VIseKriterijumska Optimizacija Kompromisno Resenje (VIKOR) method was utilized to select thebest continuous portion of Body Mass Index (BMI) data over 182 different portions, whichaccounted for 911 participants (i.e. children with complete records) over seven (7) continuoustime points. Three different machine learning algorithms to impute the missing data weretested and evaluated, namely K-nearest Neighbour (KNN), Nave Bayes (NB), and DecisionTree (DT). Three evaluation performance indicators, namely t-test, Coefficient of Determination,and Root Mean Square Error, were used in the experiment using three configurations based on 5%,10%, and 15% missing data. The results of the experiment showed that KNNs performance scores weresignificantly higher than those of the other algorithms. Out of all scores, KNN achieved 95.23% ofthe scores, followed by NB with 94.04% and DT with 83.33 %, clearly indicating that KNNoutperformed DT and NB in the imputation of missing data. In conclusion, the main findingsuggests that the KNN algorithm is the most effective algorithm for imputing missing data. Theimplication of this study is that practitioners, especially NCDRCs personnel, can use the proposedmissing data imputation framework to help impute missing data of similar datasets. |
---|