Domain data-driven veracity engine

The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subs...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, De Zhern
Format: Thesis
Published: 2020
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-mmu-ep.12984
record_format uketd_dc
spelling my-mmu-ep.129842024-09-26T03:32:47Z Domain data-driven veracity engine 2020-06 Tan, De Zhern QA75-76.95 Calculating machines The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subsequent stages of analysis and hinder the possible insights and values that can be derived from the data. Hence, methods to detect and remove dirty data are required to unlock those insights. In this research, a framework is proposed to utilize several natural language processing methods to create a novel data veracity engine. This research's contributions are first to enable the conversion of the unstructured dataset into a structured dataset in a semi-automated manner. Next, the system will also detect and remove duplicates consisting of exact and near duplicates from the dataset. The natural language processing techniques proposed in the data veracity engine include a combination of Latent Semantic Analysis and Simhash. Latent Semantic Analysis is used to discover the relation between different rows of datasets using cosine similarity and group the similar ones into clusters to apply different types of text processing per cluster basis. Simhash is locality sensitive hashing algorithm that utilizes cosine distance to find near duplicate rows of data in order to remove them from the dataset with low overhead. For the first part of the research, a semi-structured dataset consisting of actual telco equipment log data was used. For the second part of the research, datasets consisting of varying sizes of telco equipment data added with artificially generated noise were used. The results of this research showed that both types of dirty data, semi-structured data and near duplicates were able to be removed effectively from the real-world dataset using the proposed data veracity engine. It is hoped that this research can contribute to the development of more robust cleaning algorithms. 2020-06 Thesis https://shdl.mmu.edu.my/12984/ http://erep.mmu.edu.my/ masters Multimedia University Faculty of Computing and Informatics (FCI) EREP ID: 10289
institution Multimedia University
collection MMU Institutional Repository
topic QA75-76.95 Calculating machines
spellingShingle QA75-76.95 Calculating machines
Tan, De Zhern
Domain data-driven veracity engine
description The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subsequent stages of analysis and hinder the possible insights and values that can be derived from the data. Hence, methods to detect and remove dirty data are required to unlock those insights. In this research, a framework is proposed to utilize several natural language processing methods to create a novel data veracity engine. This research's contributions are first to enable the conversion of the unstructured dataset into a structured dataset in a semi-automated manner. Next, the system will also detect and remove duplicates consisting of exact and near duplicates from the dataset. The natural language processing techniques proposed in the data veracity engine include a combination of Latent Semantic Analysis and Simhash. Latent Semantic Analysis is used to discover the relation between different rows of datasets using cosine similarity and group the similar ones into clusters to apply different types of text processing per cluster basis. Simhash is locality sensitive hashing algorithm that utilizes cosine distance to find near duplicate rows of data in order to remove them from the dataset with low overhead. For the first part of the research, a semi-structured dataset consisting of actual telco equipment log data was used. For the second part of the research, datasets consisting of varying sizes of telco equipment data added with artificially generated noise were used. The results of this research showed that both types of dirty data, semi-structured data and near duplicates were able to be removed effectively from the real-world dataset using the proposed data veracity engine. It is hoped that this research can contribute to the development of more robust cleaning algorithms.
format Thesis
qualification_level Master's degree
author Tan, De Zhern
author_facet Tan, De Zhern
author_sort Tan, De Zhern
title Domain data-driven veracity engine
title_short Domain data-driven veracity engine
title_full Domain data-driven veracity engine
title_fullStr Domain data-driven veracity engine
title_full_unstemmed Domain data-driven veracity engine
title_sort domain data-driven veracity engine
granting_institution Multimedia University
granting_department Faculty of Computing and Informatics (FCI)
publishDate 2020
_version_ 1811768020706525184