Domain data-driven veracity engine

The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subs...

Full description

Saved in:

Bibliographic Details
Main Author:	Tan, De Zhern
Format:	Thesis
Published:	2020
Subjects:	QA75-76.95 Calculating machines
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-mmu-ep.12984
record_format	uketd_dc
spelling	my-mmu-ep.129842024-09-26T03:32:47Z Domain data-driven veracity engine 2020-06 Tan, De Zhern QA75-76.95 Calculating machines The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subsequent stages of analysis and hinder the possible insights and values that can be derived from the data. Hence, methods to detect and remove dirty data are required to unlock those insights. In this research, a framework is proposed to utilize several natural language processing methods to create a novel data veracity engine. This research's contributions are first to enable the conversion of the unstructured dataset into a structured dataset in a semi-automated manner. Next, the system will also detect and remove duplicates consisting of exact and near duplicates from the dataset. The natural language processing techniques proposed in the data veracity engine include a combination of Latent Semantic Analysis and Simhash. Latent Semantic Analysis is used to discover the relation between different rows of datasets using cosine similarity and group the similar ones into clusters to apply different types of text processing per cluster basis. Simhash is locality sensitive hashing algorithm that utilizes cosine distance to find near duplicate rows of data in order to remove them from the dataset with low overhead. For the first part of the research, a semi-structured dataset consisting of actual telco equipment log data was used. For the second part of the research, datasets consisting of varying sizes of telco equipment data added with artificially generated noise were used. The results of this research showed that both types of dirty data, semi-structured data and near duplicates were able to be removed effectively from the real-world dataset using the proposed data veracity engine. It is hoped that this research can contribute to the development of more robust cleaning algorithms. 2020-06 Thesis https://shdl.mmu.edu.my/12984/ http://erep.mmu.edu.my/ masters Multimedia University Faculty of Computing and Informatics (FCI) EREP ID: 10289
institution	Multimedia University
collection	MMU Institutional Repository
topic	QA75-76.95 Calculating machines
spellingShingle	QA75-76.95 Calculating machines Tan, De Zhern Domain data-driven veracity engine
description	The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subsequent stages of analysis and hinder the possible insights and values that can be derived from the data. Hence, methods to detect and remove dirty data are required to unlock those insights. In this research, a framework is proposed to utilize several natural language processing methods to create a novel data veracity engine. This research's contributions are first to enable the conversion of the unstructured dataset into a structured dataset in a semi-automated manner. Next, the system will also detect and remove duplicates consisting of exact and near duplicates from the dataset. The natural language processing techniques proposed in the data veracity engine include a combination of Latent Semantic Analysis and Simhash. Latent Semantic Analysis is used to discover the relation between different rows of datasets using cosine similarity and group the similar ones into clusters to apply different types of text processing per cluster basis. Simhash is locality sensitive hashing algorithm that utilizes cosine distance to find near duplicate rows of data in order to remove them from the dataset with low overhead. For the first part of the research, a semi-structured dataset consisting of actual telco equipment log data was used. For the second part of the research, datasets consisting of varying sizes of telco equipment data added with artificially generated noise were used. The results of this research showed that both types of dirty data, semi-structured data and near duplicates were able to be removed effectively from the real-world dataset using the proposed data veracity engine. It is hoped that this research can contribute to the development of more robust cleaning algorithms.
format	Thesis
qualification_level	Master's degree
author	Tan, De Zhern
author_facet	Tan, De Zhern
author_sort	Tan, De Zhern
title	Domain data-driven veracity engine
title_short	Domain data-driven veracity engine
title_full	Domain data-driven veracity engine
title_fullStr	Domain data-driven veracity engine
title_full_unstemmed	Domain data-driven veracity engine
title_sort	domain data-driven veracity engine
granting_institution	Multimedia University
granting_department	Faculty of Computing and Informatics (FCI)
publishDate	2020
_version_	1811768020706525184

Domain data-driven veracity engine

Similar Items