Duplicates detection approach within incomplete data sets using dynamic sorting key and hot deck compensation method

Duplicate record is a common problem within data sets, especially in huge-volume databases. The accuracy of duplicate detection determines the efficiency of the duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the record...

Full description

Saved in:
Bibliographic Details
Main Author: Abdulrahim, Abdulrazzak Ali Mohamed
Format: Thesis
Language:English
English
Published: 2022
Subjects:
Online Access:http://eprints.utem.edu.my/id/eprint/27720/1/Duplicates%20detection%20approach%20within%20incomplete%20data%20sets%20using%20dynamic%20sorting%20key%20and%20hot%20deck%20compensation%20method.pdf
http://eprints.utem.edu.my/id/eprint/27720/2/Duplicates%20detection%20approach%20within%20incomplete%20data%20sets%20using%20dynamic%20sorting%20key%20and%20hot%20deck%20compensation%20method.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Duplicate record is a common problem within data sets, especially in huge-volume databases. The accuracy of duplicate detection determines the efficiency of the duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. Keeping a database free of duplicates is crucial for most use-cases, as their existence causes false negatives and false positives when matching queries against it. These two data quality issues have negative implications for tasks, such as in the medical field, where the patient may get drugs overdosage, which could, unfortunately, cause loss of life, or parcel delivery, where a parcel can get delivered to the wrong address. While research in duplicate detection is well-established and covers different aspects of both efficiency and effectiveness, our work in this thesis focuses on both. We propose novel method to improve preprocessing task to overcome the challenge posed by the presence of missing values on the efficiency of duplicates detection before duplicate detection takes place and apply the latter in datasets even when prior labeling is not available. In this thesis, duplicate detection improvement is proposed to deal with the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. DDID is based on a set of procedures to address the problem of missing data, which is to adopt a generic approach based on high-rank attributes (high uniqueness, low missing values ), followed by compensating the missing values in high-rank attributes using the Hot Deck compensation method. Dynamic sort keys and matching strings are created from the high-rank attributes in certain lengths. These procedures that were adopted in DDID aimed to validate the expected results in successive stages of detection and achieve a high matching rate of duplicate records despite the presence of missing values through a specific detecting mechanism. The experiments included the use of four benchmark data sets (restaurant, CDDB, MusicBrainz (A), MusicBrainz (B)) to detect duplicates. The missing values were hypothetically added to the key attributes with 4% for the Restaurant data set and 1.5% for the CDDB data set, using an arbitrary pattern to simulate both complete and incomplete data sets. DuDe toolkit was used to detect duplicates as a benchmark to make a relative comparison. Duplicates detection measures have been used to evaluate DDID in terms of accuracy and use performance improvement (PI) and statistical analysis to evaluate DDID in terms of elapsed time. The results of the experiments showed that the procedures adopted in the proposed method DDID achieved a significant improvement in the accuracy of detecting duplicates compared to DuDe as it reached in the first implementation stage, 18% with the Restaurant data set while 16% with the CDDB data set; and its reached 19% and 4% for both MusicBrainz(A) and MusicBrainz(B) respectively, as compared to DuDe. Similarly, DDID achieved significant improvement in the accuracy of detecting duplicates as compared to DuDe in the second implementation stage, reaching 24%, 18%, 30%, and 3% for Restaurant, CDDB, MusicBrainz(A), and MusicBrainz(B), data sets respectively. The analysis proved that even though the data sets were incomplete, DDID was able to offer better accuracy and faster duplicate detection as compared to DuDe. The adopted procedures also had a positive effect on limiting the defect of window size in the sorted neighbourhood method, as it maintained the stability of the accuracy of detection of duplicates, in addition to improving the performance of the tested blocking methods within this study. The results of this thesis not only contribute to expanding the body of knowledge in data management specifically in the area of data quality, where the focus is given to the problem of how to detect the presence of duplicates within data sets that are incomplete. But it can also contribute to the problem of industry-scale duplicate detection.