Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques

Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review tha...

Full description

Saved in:

Bibliographic Details
Main Author:	Mohd Zebaral Hoque, Jesmeen
Format:	Thesis
Published:	2019
Subjects:	Q300-390 Cybernetics
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-mmu-ep.7737
record_format	uketd_dc
spelling	my-mmu-ep.77372023-03-06T06:55:37Z Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques 2019-01 Mohd Zebaral Hoque, Jesmeen Q300-390 Cybernetics Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing. 2019-01 Thesis http://shdl.mmu.edu.my/7737/ http://erep.mmu.edu.my/ masters Multimedia University Faculty of Engineering & Technology EREP ID: 3393
institution	Multimedia University
collection	MMU Institutional Repository
topic	Q300-390 Cybernetics
spellingShingle	Q300-390 Cybernetics Mohd Zebaral Hoque, Jesmeen Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
description	Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing.
format	Thesis
qualification_level	Master's degree
author	Mohd Zebaral Hoque, Jesmeen
author_facet	Mohd Zebaral Hoque, Jesmeen
author_sort	Mohd Zebaral Hoque, Jesmeen
title	Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_short	Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_full	Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_fullStr	Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_full_unstemmed	Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
title_sort	automatic dirty data cleaning approach for data analytics using machine learning techniques
granting_institution	Multimedia University
granting_department	Faculty of Engineering & Technology
publishDate	2019
_version_	1776101415127613440

Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques

Similar Items