Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques
Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review tha...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Published: |
2019
|
Subjects: | |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-mmu-ep.7737 |
---|---|
record_format |
uketd_dc |
spelling |
my-mmu-ep.77372023-03-06T06:55:37Z Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques 2019-01 Mohd Zebaral Hoque, Jesmeen Q300-390 Cybernetics Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing. 2019-01 Thesis http://shdl.mmu.edu.my/7737/ http://erep.mmu.edu.my/ masters Multimedia University Faculty of Engineering & Technology EREP ID: 3393 |
institution |
Multimedia University |
collection |
MMU Institutional Repository |
topic |
Q300-390 Cybernetics |
spellingShingle |
Q300-390 Cybernetics Mohd Zebaral Hoque, Jesmeen Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques |
description |
Data Analytics (DA) is a technology used to make correct decisions through proper analysis and prediction. Data cleaning is the most important and essential process in DA. Inappropriate data may lead to poor analysis and thus yield unacceptable conclusions. It is found from the literature review that incomplete (missing values), duplicate, inconsistency and inaccuracy are the four common dimensions of dirty data. Out of the four the first three dimensions are considered in this research. Hence, the objective was set to design and develop an approach for a cleaning phase in DA to overcome these three dimensions. The Python’s ‘Pandas’ and ‘NumPy’ libraries are used to overcome the issues due to duplicate and inconsistency. A new architecture to predict missing data in dataset was developed, which includes data sampling and feature selection. Data sampling is used to find the minimum consistent subset by using divideand-conquer strategy because of the possibility of huge volume of data. This sampled data with selected features is then used to train the four prediction models until their classification accuracy become stable. These four prediction models are actually selected from the eight well-known classification algorithms being used by other researchers. They are Logistic Regression, Linear Regression (LR), Linear SVM, AdaBoost Classifier, K-Nearest Neighbour, SGDClassifier, Gradient Boosting and Random Forest (RF). Their respective performances were compared through the parameters: accuracy, ROC percentages and time taken for training and testing. |
format |
Thesis |
qualification_level |
Master's degree |
author |
Mohd Zebaral Hoque, Jesmeen |
author_facet |
Mohd Zebaral Hoque, Jesmeen |
author_sort |
Mohd Zebaral Hoque, Jesmeen |
title |
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques |
title_short |
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques |
title_full |
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques |
title_fullStr |
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques |
title_full_unstemmed |
Automatic Dirty Data Cleaning Approach For Data Analytics Using Machine Learning Techniques |
title_sort |
automatic dirty data cleaning approach for data analytics using machine learning techniques |
granting_institution |
Multimedia University |
granting_department |
Faculty of Engineering & Technology |
publishDate |
2019 |
_version_ |
1776101415127613440 |