Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques

Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review tha...

全面介紹

Saved in:
書目詳細資料
主要作者: Mohd Zebaral Hoque, Jesmeen
格式: Thesis
出版: 2019
主題:
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
id my-mmu-ep.7737
record_format uketd_dc
spelling my-mmu-ep.77372023-03-06T06:55:37Z Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques 2019-01 Mohd Zebaral Hoque, Jesmeen Q300-390 Cybernetics Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing. 2019-01 Thesis http://shdl.mmu.edu.my/7737/ http://erep.mmu.edu.my/ masters Multimedia University Faculty of Engineering & Technology EREP ID: 3393
institution Multimedia University
collection MMU Institutional Repository
topic Q300-390 Cybernetics
spellingShingle Q300-390 Cybernetics
Mohd Zebaral Hoque, Jesmeen
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
description Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing.
format Thesis
qualification_level Master's degree
author Mohd Zebaral Hoque, Jesmeen
author_facet Mohd Zebaral Hoque, Jesmeen
author_sort Mohd Zebaral Hoque, Jesmeen
title Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_short Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_full Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_fullStr Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_full_unstemmed Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_sort automatic dirty data cleaning approach for data analytics using machine learning techniques
granting_institution Multimedia University
granting_department Faculty of Engineering & Technology
publishDate 2019
_version_ 1776101415127613440