Lexical paraphrase extraction with multiple semantic information

Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. D...

Full description

Saved in:
Bibliographic Details
Main Author: Ho, Chuk Fong
Format: Thesis
Language:English
Published: 2012
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.30924
record_format uketd_dc
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
topic Lexicology - Data processing
Semantic integration (Computer systems)
SNePS (Computer program language)
spellingShingle Lexicology - Data processing
Semantic integration (Computer systems)
SNePS (Computer program language)
Ho, Chuk Fong
Lexical paraphrase extraction with multiple semantic information
description Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. Due to this phenomenon called variability, NLP becomes a difficult task. Since paraphrases are different words, phrases or sentences that express the same or almost the same meaning, a variety of paraphrase extraction methods have been proposed believing that paraphrases can be used to capture this variability. In general, paraphrase extraction methods can be categorized into corpus–based and knowledge–based. A corpus–based method is dependent on syntax information (rules that govern the arrangement of words to form phrases in sentences) while a knowledge–based method is dependent on semantic information (the study of meanings). However, previous studies have shown that depending on syntax information alone can result in mistakenly extracting antonyms and barely related or unrelated words as paraphrases. Semantics on the other hand is a complex study of meanings. Therefore, extracting ,paraphrases based on shallow or a single instance of semantic information only such as synonyms or semantic relations would be ineffective as it has no difference from solving a complex problem based on incomplete information. The main purpose of this thesis is to propose a new model, called Multilayer Semantic–based Validation Paraphrase Extraction (MSVPE), which relies on the use of different types of semantic information. In particular, MSVPE collects paraphrase candidates instances of lexical resources. Then, it validates the candidates using word similarity method, sentence similarity method and domain matching technique which correspond to the use of semantic relations, definitions and domains respectively. However, there are some flaws in the existing sentence similarity methods and word similarity methods. In particular, sentence similarity methods determine the semantic similarity between sentences based on the incorrect interpretation of meaning from each sentence and with incomplete information. Word similarity methods on the other hand derive the semantic similarity between words based on multiple features which have not been processed and combined properly. Consequently, similarity judgments produced by them are not reliable. To address these problems, we also proposed: 1) a new sentence similarity method (SSMv1) that compares the actual meaning of each sentence, 2)another sentence similarity method (SSMv2) that takes into consideration multiple pieces of information, and 3) a new word similarity method (WSM) that makes use of optimally processed and combined features. In order to evaluate MSVPE, SSMv1, SSMv2 and WSM, four different experiments have been conducted on three different data sets. SSMv1, SSMv2 and WSM were tested on two standard data sets which consist of 30 pairs of definitions and 65 pairs of nouns respectively that ranged from highly synonymous to semantically unrelated and which have widely been applied for evaluation purposes. In contrast, MSVPE was tested on a data set created in this study which consists of 85 words and 56 sentences. Experimental results showed that compared with the two benchmarks based solely on syntax information, MSVPE can extract paraphrases more effectively. This is probably because semantic information is more related to meanings than syntax information. Results further showed that MSVPE with multiple instances of semantic information outperforms MSVPE with only a single instance of semantic information. Although the effectiveness of different semantic information varies, they are complementary. Experimental results also showed that SSMv1, SSMv2 and WSM outperform all of their benchmarks significantly, thus indicating that they can better simulate human inferring capability. The reason is that SSMv1 has the correct understanding of the meaning of each sentence while SSMv2 makes use of information that is complementary. WSM on the other hand consists of the optimized transformation of different types of features and the optimized combination of them representing the nearest replica of human thinking behavior.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Ho, Chuk Fong
author_facet Ho, Chuk Fong
author_sort Ho, Chuk Fong
title Lexical paraphrase extraction with multiple semantic information
title_short Lexical paraphrase extraction with multiple semantic information
title_full Lexical paraphrase extraction with multiple semantic information
title_fullStr Lexical paraphrase extraction with multiple semantic information
title_full_unstemmed Lexical paraphrase extraction with multiple semantic information
title_sort lexical paraphrase extraction with multiple semantic information
granting_institution Universiti Putra Malaysia
granting_department Faculty of Computer Science and Information Technology
publishDate 2012
url http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf
_version_ 1747811606620798976
spelling my-upm-ir.309242015-02-05T09:50:19Z Lexical paraphrase extraction with multiple semantic information 2012-08 Ho, Chuk Fong Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. Due to this phenomenon called variability, NLP becomes a difficult task. Since paraphrases are different words, phrases or sentences that express the same or almost the same meaning, a variety of paraphrase extraction methods have been proposed believing that paraphrases can be used to capture this variability. In general, paraphrase extraction methods can be categorized into corpus–based and knowledge–based. A corpus–based method is dependent on syntax information (rules that govern the arrangement of words to form phrases in sentences) while a knowledge–based method is dependent on semantic information (the study of meanings). However, previous studies have shown that depending on syntax information alone can result in mistakenly extracting antonyms and barely related or unrelated words as paraphrases. Semantics on the other hand is a complex study of meanings. Therefore, extracting ,paraphrases based on shallow or a single instance of semantic information only such as synonyms or semantic relations would be ineffective as it has no difference from solving a complex problem based on incomplete information. The main purpose of this thesis is to propose a new model, called Multilayer Semantic–based Validation Paraphrase Extraction (MSVPE), which relies on the use of different types of semantic information. In particular, MSVPE collects paraphrase candidates instances of lexical resources. Then, it validates the candidates using word similarity method, sentence similarity method and domain matching technique which correspond to the use of semantic relations, definitions and domains respectively. However, there are some flaws in the existing sentence similarity methods and word similarity methods. In particular, sentence similarity methods determine the semantic similarity between sentences based on the incorrect interpretation of meaning from each sentence and with incomplete information. Word similarity methods on the other hand derive the semantic similarity between words based on multiple features which have not been processed and combined properly. Consequently, similarity judgments produced by them are not reliable. To address these problems, we also proposed: 1) a new sentence similarity method (SSMv1) that compares the actual meaning of each sentence, 2)another sentence similarity method (SSMv2) that takes into consideration multiple pieces of information, and 3) a new word similarity method (WSM) that makes use of optimally processed and combined features. In order to evaluate MSVPE, SSMv1, SSMv2 and WSM, four different experiments have been conducted on three different data sets. SSMv1, SSMv2 and WSM were tested on two standard data sets which consist of 30 pairs of definitions and 65 pairs of nouns respectively that ranged from highly synonymous to semantically unrelated and which have widely been applied for evaluation purposes. In contrast, MSVPE was tested on a data set created in this study which consists of 85 words and 56 sentences. Experimental results showed that compared with the two benchmarks based solely on syntax information, MSVPE can extract paraphrases more effectively. This is probably because semantic information is more related to meanings than syntax information. Results further showed that MSVPE with multiple instances of semantic information outperforms MSVPE with only a single instance of semantic information. Although the effectiveness of different semantic information varies, they are complementary. Experimental results also showed that SSMv1, SSMv2 and WSM outperform all of their benchmarks significantly, thus indicating that they can better simulate human inferring capability. The reason is that SSMv1 has the correct understanding of the meaning of each sentence while SSMv2 makes use of information that is complementary. WSM on the other hand consists of the optimized transformation of different types of features and the optimized combination of them representing the nearest replica of human thinking behavior. Lexicology - Data processing Semantic integration (Computer systems) SNePS (Computer program language) 2012-08 Thesis http://psasir.upm.edu.my/id/eprint/30924/ http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf application/pdf en public phd doctoral Universiti Putra Malaysia Lexicology - Data processing Semantic integration (Computer systems) SNePS (Computer program language) Faculty of Computer Science and Information Technology