An improved K-nearest neighbor with grasshopper optimization algorithm for missing data imputation /

Concurrent with the advanced of data cleaning process, missing data have been influentially known as one of the most common issues encountered for many research area. A real collected dataset such as medical, business, transportation and education are prone to be incomplete or missing especially whe...

Full description

Saved in:
Bibliographic Details
Main Author: Nadzurah Zainal Abidin (Author)
Format: Thesis
Language:English
Published: Kuala Lumpur : Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, 2020
Subjects:
Online Access:http://studentrepo.iium.edu.my/handle/123456789/9838
Tags: Add Tag
No Tags, Be the first to tag this record!
LEADER 049140000a22004450004500
008 200922s2020 my a f m 000 0 eng d
040 |a UIAM  |b eng  |e rda 
041 |a eng 
043 |a a-my--- 
050 0 0 |a QA76.9.A43 
100 0 |a Nadzurah Zainal Abidin,  |e author 
245 1 3 |a An improved K-nearest neighbor with grasshopper optimization algorithm for missing data imputation /  |c by Nadzurah Zainal Abidin 
264 1 |a Kuala Lumpur :  |b Kulliyyah of Information and Communication Technology, International Islamic University Malaysia,  |c 2020 
300 |a xv, 110 leaves :  |b illustrations ;  |c 30cm. 
336 |2 rdacontent  |a text 
337 |2 rdamedia  |a unmediated 
337 |2 rdamedia  |a computer 
338 |2 rdacarrier  |a volume 
338 |2 rdacarrier  |a computer disc 
338 |2 rdacarrier  |a online resource 
347 |2 rdaft  |a text file  |b PDF 
500 |a Abstracts in English and Arabic. 
500 |a "A thesis submitted in fulfilment of the requirement for the degree of Master in Computer Science." --On title page. 
502 |a Thesis (MCS)--International Islamic University Malaysia, 2020. 
504 |a Includes bibliographical references (leaves 101-108). 
520 |a Concurrent with the advanced of data cleaning process, missing data have been influentially known as one of the most common issues encountered for many research area. A real collected dataset such as medical, business, transportation and education are prone to be incomplete or missing especially when the respondents does not respond due to stress, fatigue or inadequacy of knowledge, some of the questions given are sensitive, and lack of option answers presented. One of the mechanisms in solving missing data is through imputation, which is the activity of substituting missing values with plausible records that yield to reasonable accuracy against actual values. A huge number of imputation algorithm has been proposed to estimate the missing values. Unfortunately, most imputation method employed provide less reliable estimations for missing data. Therefore, to accurately deal with missing data, an optimization of one of the state-of-the-art imputation algorithm, K-nearest neighbors (KNN), are proposed to impute those missing values. KNN algorithm has been widely adopted as an imputation algorithm for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. However, in many cases, KNN suffers from high computational cost, greater storage requirements, sensitive to noise, high time complexity, and difficult to choose the right centroid position and choice of different function for measuring the distance. Therefore, a conventional way of KNN computes an imputation method still imposes undesirable results. Accordingly, this thesis proposes to develop an optimized KNN imputation method with Grasshopper optimization algorithm (GOA) to present a better imputation result. Grasshopper optimization algorithm is a recent population based metaheuristics which have shown an improved results and efficiencies in tackling issues with missing data. The GOA is incorporated in the algorithm structure, inspired from the natural behavior of grasshopper that maximizes the imputation performance of KNN. The performances of the proposed algorithm will be applied to nine different datasets and compared with other optimization algorithms: Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Dragonfly Optimization (DA), Firefly Algorithm (FFA), Ant Lion Optimization (ALO), and Moth Flame Optimization (MFO), in terms of statistical correlation, error accuracy, and running time. The results show KNNGOA has the most promising performance and outperform among other optimization algorithms with regards to imputation accuracy and fastest time computing for datasets that are large and higher percentage in missing rates (20 percent and above). The analysis of statistical test is also conducted which supports the conclusion of the experiment. 
596 |a 1 
650 0 |a Computer algorithms 
650 0 |a Heuristic algorithms 
650 0 |a Metaheuristics 
650 0 |a Missing observations (Statistics) 
655 7 |a Theses, IIUM local 
690 |a Dissertations, Academic  |x Department of Computer Science  |z IIUM 
700 0 |a Amelia Ritahani Ismail,  |e degree supervisor 
710 2 |a International Islamic University Malaysia.  |b Department of Computer Science 
856 4 |u http://studentrepo.iium.edu.my/handle/123456789/9838 
900 |a sz-ash-sar 
999 |c 439310  |d 470824 
952 |0 0  |6 T QA 000076.9 A43 N126I 2020  |7 0  |8 THESES  |9 761751  |a IIUM  |b IIUM  |c MULTIMEDIA  |g 0.00  |o t QA 76.9 A43 N126I 2020  |p 11100418043  |r 2021-04-21  |t 1  |v 0.00  |y THESIS 
952 |0 0  |6 TS CDF QA 76.9 A43 N126I 2020  |7 0  |8 THESES  |9 859264  |a IIUM  |b IIUM  |c MULTIMEDIA  |g 0.00  |o ts cdf QA 76.9 A43 N126I 2020  |p 11100418044  |r 2021-04-21  |t 1  |v 0.00  |y THESISDIG