An improved K-nearest neighbor with grasshopper optimization algorithm for missing data imputation /

Concurrent with the advanced of data cleaning process, missing data have been influentially known as one of the most common issues encountered for many research area. A real collected dataset such as medical, business, transportation and education are prone to be incomplete or missing especially whe...

Full description

Saved in:
Bibliographic Details
Main Author: Nadzurah Zainal Abidin (Author)
Format: Thesis
Language:English
Published: Kuala Lumpur : Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, 2020
Subjects:
Online Access:http://studentrepo.iium.edu.my/handle/123456789/9838
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Concurrent with the advanced of data cleaning process, missing data have been influentially known as one of the most common issues encountered for many research area. A real collected dataset such as medical, business, transportation and education are prone to be incomplete or missing especially when the respondents does not respond due to stress, fatigue or inadequacy of knowledge, some of the questions given are sensitive, and lack of option answers presented. One of the mechanisms in solving missing data is through imputation, which is the activity of substituting missing values with plausible records that yield to reasonable accuracy against actual values. A huge number of imputation algorithm has been proposed to estimate the missing values. Unfortunately, most imputation method employed provide less reliable estimations for missing data. Therefore, to accurately deal with missing data, an optimization of one of the state-of-the-art imputation algorithm, K-nearest neighbors (KNN), are proposed to impute those missing values. KNN algorithm has been widely adopted as an imputation algorithm for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. However, in many cases, KNN suffers from high computational cost, greater storage requirements, sensitive to noise, high time complexity, and difficult to choose the right centroid position and choice of different function for measuring the distance. Therefore, a conventional way of KNN computes an imputation method still imposes undesirable results. Accordingly, this thesis proposes to develop an optimized KNN imputation method with Grasshopper optimization algorithm (GOA) to present a better imputation result. Grasshopper optimization algorithm is a recent population based metaheuristics which have shown an improved results and efficiencies in tackling issues with missing data. The GOA is incorporated in the algorithm structure, inspired from the natural behavior of grasshopper that maximizes the imputation performance of KNN. The performances of the proposed algorithm will be applied to nine different datasets and compared with other optimization algorithms: Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Dragonfly Optimization (DA), Firefly Algorithm (FFA), Ant Lion Optimization (ALO), and Moth Flame Optimization (MFO), in terms of statistical correlation, error accuracy, and running time. The results show KNNGOA has the most promising performance and outperform among other optimization algorithms with regards to imputation accuracy and fastest time computing for datasets that are large and higher percentage in missing rates (20 percent and above). The analysis of statistical test is also conducted which supports the conclusion of the experiment.
Item Description:Abstracts in English and Arabic.
"A thesis submitted in fulfilment of the requirement for the degree of Master in Computer Science." --On title page.
Physical Description:xv, 110 leaves : illustrations ; 30cm.
Bibliography:Includes bibliographical references (leaves 101-108).