Web based cross language semantic plagiarism detectio

Recently, cross language and semantic plagiarism are on the rise. Many plagiarism detection tools are not capable to detect such plagiarism cases. In this research, we propose a new framework which involves summarization, cross language and semantic plagiarism detection. We consider Bahasa Melayu as...

Full description

Saved in:
Bibliographic Details
Main Author: Chow, Kok Kent
Format: Thesis
Published: 2013
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recently, cross language and semantic plagiarism are on the rise. Many plagiarism detection tools are not capable to detect such plagiarism cases. In this research, we propose a new framework which involves summarization, cross language and semantic plagiarism detection. We consider Bahasa Melayu as the input language of the submitted document and English as the language of, possibly plagiarised documents. In this framework we shorten the query document by utilising fuzzy swarm-based summarisation approach. With this summarisation approach, sentences are chosen based on their importance level that determined by five predefined sentence features, which integrated with fuzzy logic. This technique is chosen for its effectiveness achieved in previous research. Input summary documents are translated into English using Google Translate Application Programming Interface (API) before the words are stemmed and the stop words are removed. Tokenized documents are sent to the Google AJAX Search API to detect similar documents throughout the World Wide Web. We integrate the use of Stanford Parser and WordNet to determine the semantic similarity level between the suspected documents and candidate source documents. Stanford parser assigns each terms in the sentence to their corresponding roles such as nouns, verbs and adjectives. Based on these roles, we represent each sentence in a predicate form and similarity is measured based on those predicates using information content value from WordNet taxonomy. The testing dataset is built up from two sets of Malay documents which are produced based on different plagiarism practices. The result of our proposed semantic based similarity measurement shows that it can achieve higher precision, recall and f-measure compared to the conventional Longest Common Subsequence (LCS) approach, which determines similarity between sentences based on their common subsequence from left to right with maximum length, regardless of their consecutive arrangement.