Domain data-driven veracity engine

The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subs...

Full description

Saved in:
Bibliographic Details
Main Author: Tan, De Zhern
Format: Thesis
Published: 2020
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The exponentially increasing volume of data comes with dirty data such as duplicates and noise in the data. Dirty data occurs due to the nature of human error and the format of data generated by machine logging. The presence of dirty data will affect machine learning models' performance in subsequent stages of analysis and hinder the possible insights and values that can be derived from the data. Hence, methods to detect and remove dirty data are required to unlock those insights. In this research, a framework is proposed to utilize several natural language processing methods to create a novel data veracity engine. This research's contributions are first to enable the conversion of the unstructured dataset into a structured dataset in a semi-automated manner. Next, the system will also detect and remove duplicates consisting of exact and near duplicates from the dataset. The natural language processing techniques proposed in the data veracity engine include a combination of Latent Semantic Analysis and Simhash. Latent Semantic Analysis is used to discover the relation between different rows of datasets using cosine similarity and group the similar ones into clusters to apply different types of text processing per cluster basis. Simhash is locality sensitive hashing algorithm that utilizes cosine distance to find near duplicate rows of data in order to remove them from the dataset with low overhead. For the first part of the research, a semi-structured dataset consisting of actual telco equipment log data was used. For the second part of the research, datasets consisting of varying sizes of telco equipment data added with artificially generated noise were used. The results of this research showed that both types of dirty data, semi-structured data and near duplicates were able to be removed effectively from the real-world dataset using the proposed data veracity engine. It is hoped that this research can contribute to the development of more robust cleaning algorithms.