Lexical paraphrase extraction with multiple semantic information
Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. D...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2012
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-upm-ir.30924 |
---|---|
record_format |
uketd_dc |
institution |
Universiti Putra Malaysia |
collection |
PSAS Institutional Repository |
language |
English |
topic |
Lexicology - Data processing Semantic integration (Computer systems) SNePS (Computer program language) |
spellingShingle |
Lexicology - Data processing Semantic integration (Computer systems) SNePS (Computer program language) Ho, Chuk Fong Lexical paraphrase extraction with multiple semantic information |
description |
Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. Due to this phenomenon called variability, NLP becomes a difficult task. Since paraphrases are different words, phrases or sentences that express the same or almost the same meaning, a variety of paraphrase extraction methods have been proposed believing that paraphrases can be used to capture this variability. In general, paraphrase extraction methods can be categorized into corpus–based and knowledge–based. A corpus–based method is dependent on syntax information (rules that govern the arrangement of words to form phrases in sentences) while a knowledge–based method is dependent on semantic information (the study of meanings). However, previous studies have shown that depending on syntax information alone can result in mistakenly extracting antonyms and barely related or unrelated words as paraphrases. Semantics on the other hand is a complex study of meanings. Therefore, extracting ,paraphrases based on shallow or a single instance of semantic information only such as synonyms or semantic relations would be ineffective as it has no difference from solving a complex problem based on incomplete information. The main purpose of this thesis is to propose a new model, called Multilayer Semantic–based Validation Paraphrase Extraction (MSVPE), which relies on the use of different types of semantic information. In particular, MSVPE collects paraphrase candidates instances of lexical resources. Then, it validates the candidates using word similarity method, sentence similarity method and domain matching technique which correspond to the use of semantic relations, definitions and domains respectively. However, there are some flaws in the existing sentence similarity methods and word similarity methods. In particular, sentence similarity methods determine the semantic similarity between sentences based on the incorrect interpretation of meaning from each sentence and with incomplete information. Word similarity methods on the other hand derive the semantic similarity between words based on multiple features which have not been processed and combined properly. Consequently, similarity judgments produced by them are not reliable. To address these problems, we also proposed: 1) a new sentence similarity method (SSMv1) that compares the actual meaning of each sentence, 2)another sentence similarity method (SSMv2) that takes into consideration multiple pieces of information, and 3) a new word similarity method (WSM) that makes use of optimally processed and combined features. In order to evaluate MSVPE, SSMv1, SSMv2 and WSM, four different experiments have been conducted on three different data sets. SSMv1, SSMv2 and WSM were tested on two standard data sets which consist of 30 pairs of definitions and 65 pairs of nouns respectively that ranged from highly synonymous to semantically unrelated and which have widely been applied for evaluation purposes. In contrast, MSVPE was tested on a data set created in this study which consists of 85 words and 56 sentences. Experimental results showed that compared with the two benchmarks based solely on syntax information, MSVPE can extract paraphrases more effectively. This is probably because semantic information is more related to meanings than syntax information. Results further showed that MSVPE with multiple instances of semantic information outperforms MSVPE with only a single instance of semantic information. Although the effectiveness of different semantic information varies, they are complementary. Experimental results also showed that SSMv1, SSMv2 and WSM outperform all of their benchmarks significantly, thus indicating that they can better simulate human inferring capability. The reason is that SSMv1 has the correct understanding of the meaning of each sentence while SSMv2 makes use of information that is complementary. WSM on the other hand consists of the optimized transformation of different types of features and the optimized combination of them representing the nearest replica of human thinking behavior.
|
format |
Thesis |
qualification_name |
Doctor of Philosophy (PhD.) |
qualification_level |
Doctorate |
author |
Ho, Chuk Fong |
author_facet |
Ho, Chuk Fong |
author_sort |
Ho, Chuk Fong |
title |
Lexical paraphrase extraction with multiple semantic information |
title_short |
Lexical paraphrase extraction with multiple semantic information |
title_full |
Lexical paraphrase extraction with multiple semantic information |
title_fullStr |
Lexical paraphrase extraction with multiple semantic information |
title_full_unstemmed |
Lexical paraphrase extraction with multiple semantic information |
title_sort |
lexical paraphrase extraction with multiple semantic information |
granting_institution |
Universiti Putra Malaysia |
granting_department |
Faculty of Computer Science and Information Technology |
publishDate |
2012 |
url |
http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf |
_version_ |
1747811606620798976 |
spelling |
my-upm-ir.309242015-02-05T09:50:19Z Lexical paraphrase extraction with multiple semantic information 2012-08 Ho, Chuk Fong Natural language processing (NLP) refers to the interaction that happens between humans and computers where computers try to understand and make sense of human languages. However, human beings tend to express similar meanings using sentences with different structures or different surface wordings. Due to this phenomenon called variability, NLP becomes a difficult task. Since paraphrases are different words, phrases or sentences that express the same or almost the same meaning, a variety of paraphrase extraction methods have been proposed believing that paraphrases can be used to capture this variability. In general, paraphrase extraction methods can be categorized into corpus–based and knowledge–based. A corpus–based method is dependent on syntax information (rules that govern the arrangement of words to form phrases in sentences) while a knowledge–based method is dependent on semantic information (the study of meanings). However, previous studies have shown that depending on syntax information alone can result in mistakenly extracting antonyms and barely related or unrelated words as paraphrases. Semantics on the other hand is a complex study of meanings. Therefore, extracting ,paraphrases based on shallow or a single instance of semantic information only such as synonyms or semantic relations would be ineffective as it has no difference from solving a complex problem based on incomplete information. The main purpose of this thesis is to propose a new model, called Multilayer Semantic–based Validation Paraphrase Extraction (MSVPE), which relies on the use of different types of semantic information. In particular, MSVPE collects paraphrase candidates instances of lexical resources. Then, it validates the candidates using word similarity method, sentence similarity method and domain matching technique which correspond to the use of semantic relations, definitions and domains respectively. However, there are some flaws in the existing sentence similarity methods and word similarity methods. In particular, sentence similarity methods determine the semantic similarity between sentences based on the incorrect interpretation of meaning from each sentence and with incomplete information. Word similarity methods on the other hand derive the semantic similarity between words based on multiple features which have not been processed and combined properly. Consequently, similarity judgments produced by them are not reliable. To address these problems, we also proposed: 1) a new sentence similarity method (SSMv1) that compares the actual meaning of each sentence, 2)another sentence similarity method (SSMv2) that takes into consideration multiple pieces of information, and 3) a new word similarity method (WSM) that makes use of optimally processed and combined features. In order to evaluate MSVPE, SSMv1, SSMv2 and WSM, four different experiments have been conducted on three different data sets. SSMv1, SSMv2 and WSM were tested on two standard data sets which consist of 30 pairs of definitions and 65 pairs of nouns respectively that ranged from highly synonymous to semantically unrelated and which have widely been applied for evaluation purposes. In contrast, MSVPE was tested on a data set created in this study which consists of 85 words and 56 sentences. Experimental results showed that compared with the two benchmarks based solely on syntax information, MSVPE can extract paraphrases more effectively. This is probably because semantic information is more related to meanings than syntax information. Results further showed that MSVPE with multiple instances of semantic information outperforms MSVPE with only a single instance of semantic information. Although the effectiveness of different semantic information varies, they are complementary. Experimental results also showed that SSMv1, SSMv2 and WSM outperform all of their benchmarks significantly, thus indicating that they can better simulate human inferring capability. The reason is that SSMv1 has the correct understanding of the meaning of each sentence while SSMv2 makes use of information that is complementary. WSM on the other hand consists of the optimized transformation of different types of features and the optimized combination of them representing the nearest replica of human thinking behavior. Lexicology - Data processing Semantic integration (Computer systems) SNePS (Computer program language) 2012-08 Thesis http://psasir.upm.edu.my/id/eprint/30924/ http://psasir.upm.edu.my/id/eprint/30924/1/FSKTM%202012%202R.pdf application/pdf en public phd doctoral Universiti Putra Malaysia Lexicology - Data processing Semantic integration (Computer systems) SNePS (Computer program language) Faculty of Computer Science and Information Technology |