A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage

Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in differ...

Full description

Saved in:

Bibliographic Details
Main Author:	Ektefa, Mohammadreza
Format:	Thesis
Published:	2011
Subjects:	Semantic computing Semantic integration (Computer systems) Data warehousing
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-upm-ir.19638
record_format	uketd_dc
spelling	my-upm-ir.196382014-06-30T07:17:08Z A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage 2011-06 Ektefa, Mohammadreza Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in different formats. Therefore, when these databases are combined, the availability of entities in different formats causes duplication. Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage models with different steps have been developed in order to detect such duplicate records. For this purpose, string similarity measures are widely utilized for comparing record-pairs in different studies. However, in addition to string similarity, considering the semantic relatedness between two records can be also beneficial in the process of detecting duplicate records. This issue is not regarded in existing record linkage models. To determine the importance of semantic similarity in improving the effectiveness of detecting duplicate records, a similarity measure based on the combination of string and semantic similarity measures is proposed in this study. For combination purpose, a threshold-based method which considers the semantic similarity for each field of the dataset is proposed. This threshold determines the influence of semantic similarity in the final combination algorithm. The combined similarity measure is experimented on two real world datasets, namely Restaurant and Cora and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the combined similarity measure which is based on the combination of string and semantic similarity measures outperforms the string and semantic similarity measures, which are used individually, with the F-measure of 99.1% in Restaurant dataset, and 88.3% in Cora dataset. Therefore, based on the experimental results, semantic similarity should be taken into account in addition to string similarity in order to detect duplicate records more effectively in recork linkage Semantic computing Semantic integration (Computer systems) Data warehousing 2011-06 Thesis http://psasir.upm.edu.my/id/eprint/19638/ masters Universiti Putra Malaysia Semantic computing Semantic integration (Computer systems) Data warehousing Faculty of Computer Science and Imformation Technology
institution	Universiti Putra Malaysia
collection	PSAS Institutional Repository
topic	Semantic computing Semantic integration (Computer systems) Data warehousing
spellingShingle	Semantic computing Semantic integration (Computer systems) Data warehousing Ektefa, Mohammadreza A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
description	Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in different formats. Therefore, when these databases are combined, the availability of entities in different formats causes duplication. Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage models with different steps have been developed in order to detect such duplicate records. For this purpose, string similarity measures are widely utilized for comparing record-pairs in different studies. However, in addition to string similarity, considering the semantic relatedness between two records can be also beneficial in the process of detecting duplicate records. This issue is not regarded in existing record linkage models. To determine the importance of semantic similarity in improving the effectiveness of detecting duplicate records, a similarity measure based on the combination of string and semantic similarity measures is proposed in this study. For combination purpose, a threshold-based method which considers the semantic similarity for each field of the dataset is proposed. This threshold determines the influence of semantic similarity in the final combination algorithm. The combined similarity measure is experimented on two real world datasets, namely Restaurant and Cora and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the combined similarity measure which is based on the combination of string and semantic similarity measures outperforms the string and semantic similarity measures, which are used individually, with the F-measure of 99.1% in Restaurant dataset, and 88.3% in Cora dataset. Therefore, based on the experimental results, semantic similarity should be taken into account in addition to string similarity in order to detect duplicate records more effectively in recork linkage
format	Thesis
qualification_level	Master's degree
author	Ektefa, Mohammadreza
author_facet	Ektefa, Mohammadreza
author_sort	Ektefa, Mohammadreza
title	A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_short	A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_full	A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_fullStr	A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_full_unstemmed	A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
title_sort	threshold-based combination of string and semantic similarity measures for record linkage
granting_institution	Universiti Putra Malaysia
granting_department	Faculty of Computer Science and Imformation Technology
publishDate	2011
_version_	1747811430769360896

A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage

Similar Items