A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage

Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in differ...

Full description

Saved in:
Bibliographic Details
Main Author: Ektefa, Mohammadreza
Format: Thesis
Published: 2011
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.19638
record_format uketd_dc
spelling my-upm-ir.196382014-06-30T07:17:08Z A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage 2011-06 Ektefa, Mohammadreza Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in different formats. Therefore, when these databases are combined, the availability of entities in different formats causes duplication. Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage models with different steps have been developed in order to detect such duplicate records. For this purpose, string similarity measures are widely utilized for comparing record-pairs in different studies. However, in addition to string similarity, considering the semantic relatedness between two records can be also beneficial in the process of detecting duplicate records. This issue is not regarded in existing record linkage models. To determine the importance of semantic similarity in improving the effectiveness of detecting duplicate records, a similarity measure based on the combination of string and semantic similarity measures is proposed in this study. For combination purpose, a threshold-based method which considers the semantic similarity for each field of the dataset is proposed. This threshold determines the influence of semantic similarity in the final combination algorithm. The combined similarity measure is experimented on two real world datasets, namely Restaurant and Cora and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the combined similarity measure which is based on the combination of string and semantic similarity measures outperforms the string and semantic similarity measures, which are used individually, with the F-measure of 99.1% in Restaurant dataset, and 88.3% in Cora dataset. Therefore, based on the experimental results, semantic similarity should be taken into account in addition to string similarity in order to detect duplicate records more effectively in recork linkage Semantic computing Semantic integration (Computer systems) Data warehousing 2011-06 Thesis http://psasir.upm.edu.my/id/eprint/19638/ masters Universiti Putra Malaysia Semantic computing Semantic integration (Computer systems) Data warehousing Faculty of Computer Science and Imformation Technology
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
topic Semantic computing
Semantic integration (Computer systems)
Data warehousing
spellingShingle Semantic computing
Semantic integration (Computer systems)
Data warehousing
Ektefa, Mohammadreza
A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
description Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in different formats. Therefore, when these databases are combined, the availability of entities in different formats causes duplication. Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage models with different steps have been developed in order to detect such duplicate records. For this purpose, string similarity measures are widely utilized for comparing record-pairs in different studies. However, in addition to string similarity, considering the semantic relatedness between two records can be also beneficial in the process of detecting duplicate records. This issue is not regarded in existing record linkage models. To determine the importance of semantic similarity in improving the effectiveness of detecting duplicate records, a similarity measure based on the combination of string and semantic similarity measures is proposed in this study. For combination purpose, a threshold-based method which considers the semantic similarity for each field of the dataset is proposed. This threshold determines the influence of semantic similarity in the final combination algorithm. The combined similarity measure is experimented on two real world datasets, namely Restaurant and Cora and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the combined similarity measure which is based on the combination of string and semantic similarity measures outperforms the string and semantic similarity measures, which are used individually, with the F-measure of 99.1% in Restaurant dataset, and 88.3% in Cora dataset. Therefore, based on the experimental results, semantic similarity should be taken into account in addition to string similarity in order to detect duplicate records more effectively in recork linkage
format Thesis
qualification_level Master's degree
author Ektefa, Mohammadreza
author_facet Ektefa, Mohammadreza
author_sort Ektefa, Mohammadreza
title A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_short A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_full A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_fullStr A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_full_unstemmed A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_sort threshold-based combination of string and semantic similarity measures for record linkage
granting_institution Universiti Putra Malaysia
granting_department Faculty of Computer Science and Imformation Technology
publishDate 2011
_version_ 1747811430769360896