Cross-document coreference resolution model based on neural entity embedding

Natural Language Processing (NLP) is a way for computers to derive, analyze, and understand the meaning of human language in a smart and useful way. NLP considers the hierarchical structure of language that enables real-world applications such as automatic text summarization, event resolution, relat...

Full description

Saved in:

Bibliographic Details
Main Author:	Keshtkaran, Aliakbar
Format:	Thesis
Language:	English
Published:	2021
Subjects:	QA Mathematics QA76 Computer software
Online Access:	http://eprints.utm.my/106978/1/AliakbarKeshtkaranPFTIR2021.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-utm-ep.106978
record_format	uketd_dc
spelling	my-utm-ep.1069782024-08-28T09:23:35Z Cross-document coreference resolution model based on neural entity embedding 2021 Keshtkaran, Aliakbar QA Mathematics QA76 Computer software Natural Language Processing (NLP) is a way for computers to derive, analyze, and understand the meaning of human language in a smart and useful way. NLP considers the hierarchical structure of language that enables real-world applications such as automatic text summarization, event resolution, relationship extraction, and entity recognition to be presented in a proper human-computer interaction. One of the NLP components called Coreference Resolution (CR) is to determine whether the two noun phrases in natural language are referring to the same entity. In this context, an entity can be a real person, organization, place, or others, in which the referred term of such entity is called a mention. The task of CR when extended to resolve co-referent entities across multiple documents creates the Cross-Document Coreference Resolution (CDCR) task which requires special techniques to manage and address the mention chains within documents co-referring to the same entity across different documents. Currently, there are some limitations in the existing works in which the CDCR entities by variant referencing mentions are not well identified, and the grouping process to differentiate entities with lexical similarity is not well addressed. The main objective of this research is to propose a CDCR model using neural embedding of the entities and their mentions created by the representation of words using merely the input documents. This model created vectors of mentions and entities using neural embedding of mentions, regardless of the use of any external resources such as Knowledge Bases. For an advanced grouping of entities and their mentions, an improved density-based clustering technique containing DBSCAN and HDBSCAN clustering algorithms was employed. In addition, a prototype named CROCER was designed and developed as proof of concept to assess the model in an experimental environment. For evaluation, this model was applied to three publicly available datasets, called ‘John Smith Corpus’, ‘WePS-2 Collection’, and ‘Google Wikilinks’ from public open-source repositories. It measured the precision, recall, and F1 score of the model by three known scoring systems for Coreference Resolution, which are MUC, B3, and CEAF. Based on the findings, it can be concluded that the proposed model improved the F1 score of the datasets by almost 15.7%, 1.5%, and 9%, respectively. 2021 Thesis http://eprints.utm.my/106978/ http://eprints.utm.my/106978/1/AliakbarKeshtkaranPFTIR2021.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:156348?site_name=Restricted+Repository&query=Cross-document+coreference+resolution+model+based+on+neural+entity+embedding&queryType=vitalDismax phd doctoral Universiti Teknologi Malaysia Razak Faculty of Technology and Informatics Natural Language Processing (NLP). Coreference Resolution (CR). Cross-Document Coreference Resolution (CDCR)
institution	Universiti Teknologi Malaysia
collection	UTM Institutional Repository
language	English
topic	QA Mathematics QA76 Computer software
spellingShingle	QA Mathematics QA76 Computer software Keshtkaran, Aliakbar Cross-document coreference resolution model based on neural entity embedding
description	Natural Language Processing (NLP) is a way for computers to derive, analyze, and understand the meaning of human language in a smart and useful way. NLP considers the hierarchical structure of language that enables real-world applications such as automatic text summarization, event resolution, relationship extraction, and entity recognition to be presented in a proper human-computer interaction. One of the NLP components called Coreference Resolution (CR) is to determine whether the two noun phrases in natural language are referring to the same entity. In this context, an entity can be a real person, organization, place, or others, in which the referred term of such entity is called a mention. The task of CR when extended to resolve co-referent entities across multiple documents creates the Cross-Document Coreference Resolution (CDCR) task which requires special techniques to manage and address the mention chains within documents co-referring to the same entity across different documents. Currently, there are some limitations in the existing works in which the CDCR entities by variant referencing mentions are not well identified, and the grouping process to differentiate entities with lexical similarity is not well addressed. The main objective of this research is to propose a CDCR model using neural embedding of the entities and their mentions created by the representation of words using merely the input documents. This model created vectors of mentions and entities using neural embedding of mentions, regardless of the use of any external resources such as Knowledge Bases. For an advanced grouping of entities and their mentions, an improved density-based clustering technique containing DBSCAN and HDBSCAN clustering algorithms was employed. In addition, a prototype named CROCER was designed and developed as proof of concept to assess the model in an experimental environment. For evaluation, this model was applied to three publicly available datasets, called ‘John Smith Corpus’, ‘WePS-2 Collection’, and ‘Google Wikilinks’ from public open-source repositories. It measured the precision, recall, and F1 score of the model by three known scoring systems for Coreference Resolution, which are MUC, B3, and CEAF. Based on the findings, it can be concluded that the proposed model improved the F1 score of the datasets by almost 15.7%, 1.5%, and 9%, respectively.
format	Thesis
qualification_name	Doctor of Philosophy (PhD.)
qualification_level	Doctorate
author	Keshtkaran, Aliakbar
author_facet	Keshtkaran, Aliakbar
author_sort	Keshtkaran, Aliakbar
title	Cross-document coreference resolution model based on neural entity embedding
title_short	Cross-document coreference resolution model based on neural entity embedding
title_full	Cross-document coreference resolution model based on neural entity embedding
title_fullStr	Cross-document coreference resolution model based on neural entity embedding
title_full_unstemmed	Cross-document coreference resolution model based on neural entity embedding
title_sort	cross-document coreference resolution model based on neural entity embedding
granting_institution	Universiti Teknologi Malaysia
granting_department	Razak Faculty of Technology and Informatics
publishDate	2021
url	http://eprints.utm.my/106978/1/AliakbarKeshtkaranPFTIR2021.pdf
_version_	1811772233099509760

Cross-document coreference resolution model based on neural entity embedding

Similar Items