Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques

Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review tha...

Full description

Saved in:
Bibliographic Details
Main Author: Mohd Zebaral Hoque, Jesmeen
Format: Thesis
Published: 2019
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-mmu-ep.7737
record_format uketd_dc
spelling my-mmu-ep.77372023-03-06T06:55:37Z Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques 2019-01 Mohd Zebaral Hoque, Jesmeen Q300-390 Cybernetics Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing. 2019-01 Thesis http://shdl.mmu.edu.my/7737/ http://erep.mmu.edu.my/ masters Multimedia University Faculty of Engineering & Technology EREP ID: 3393
institution Multimedia University
collection MMU Institutional Repository
topic Q300-390 Cybernetics
spellingShingle Q300-390 Cybernetics
Mohd Zebaral Hoque, Jesmeen
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
description Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing.
format Thesis
qualification_level Master's degree
author Mohd Zebaral Hoque, Jesmeen
author_facet Mohd Zebaral Hoque, Jesmeen
author_sort Mohd Zebaral Hoque, Jesmeen
title Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_short Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_full Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_fullStr Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_full_unstemmed Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_sort automatic dirty data cleaning approach for data analytics using machine learning techniques
granting_institution Multimedia University
granting_department Faculty of Engineering & Technology
publishDate 2019
_version_ 1776101415127613440