A deep learning framework for the defection of source code plagiarism using Siamese network and embedding models /
Source code plagiarism represents an ongoing problem that threatens academic integrity and intellectual rights. Various research works on detection approaches have been proposed to overcome prolonged manual inspection as it requires laborious efforts and consumes time. These detection approaches can...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
Kuala Lumpur :
Kulliyyah of Information and Communication Technology,International Islamic University Malaysia,
2021
|
Subjects: | |
Online Access: | http://studentrepo.iium.edu.my/handle/123456789/10996 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
LEADER | 044480000a22004090004500 | ||
---|---|---|---|
008 | 220425s2021 my f m 000 0 eng d | ||
040 | |a UIAM |b eng |e rda | ||
041 | |a eng | ||
043 | |a a-my--- | ||
050 | 0 | 0 | |a Q325.73 |
100 | 1 | |a Manahi, Mohammed S.M. |9 7409 |e author | |
245 | 1 | 2 | |a A deep learning framework for the defection of source code plagiarism using Siamese network and embedding models / |c by Mohammed S.M.Manahi |
264 | 1 | |a Kuala Lumpur : |b Kulliyyah of Information and Communication Technology,International Islamic University Malaysia, |c 2021 | |
300 | |a xviii, 142 leaves : |b illustrations ; |c 30 cm. | ||
336 | |2 rdacontent |a text | ||
337 | |2 rdmedia |a unmediated | ||
337 | |2 rdamedia |a computer | ||
338 | |2 rdacarrier |a volume | ||
338 | |2 rdacarrier |a online resource | ||
347 | |2 rdaft |a text file |b PDF | ||
500 | |a Abstracts in English and Arabic. | ||
500 | |a "A thesis submitted in fulfilment of the requirement for the degree of Master of Computing (Computer Science and Information Technology." --On title page. | ||
502 | |a Thesis (MCST)--International Islamic University Malaysia, 2021. | ||
504 | |a Includes bibliographical references (leaves 132-141). | ||
520 | |a Source code plagiarism represents an ongoing problem that threatens academic integrity and intellectual rights. Various research works on detection approaches have been proposed to overcome prolonged manual inspection as it requires laborious efforts and consumes time. These detection approaches can be categorised into four major domains; software engineering, knowledge discovery, shallow parsing and machine learning. Review of the literature revealed that most of the detection approaches had been evaluated based on the commonly referenced and established six-level classification of source code transformations known as the Faidhi and Robinson spectrum, except for the approaches in the machine learning domain. Thus, this research sought to fill the gap in the absence of a machine learning approach that uses embedding models to detect source code plagiarism and evaluated based on the six-level classification. The objectives of this research are threefold; to extract various embedding sequences as similarity features from source codes using embedding models, to train a Siamese network that learns similarity representations from source code embedding sequences, and to develop a deep learning framework that leverages embedding sequences and Siamese network to identify the most accurate detection based on the standard six-level classification of plagiarism activities defined by Faidhi and Robinson. A deep learning framework that utilised a Siamese network and embedding models is proposed to detect deliberate plagiarism in source codes. The proposed framework split source codes into character-based, word-based and token-based sequences to obtain embedding sequences through Word2Vec and fastText models. These embedding sequences were then used as inputs to the Siamese BLSTM network for learning similarity representations. The experimental results showed that the character-based embedding sequences with Word2Vec, Skip Gram and Negative Sampling (W2V-SGNS) approach and the token-based embedding sequences with FastText, Skip Gram and Hierarchical Softmax (FT-SGHS) approach outperformed the other approaches. The detection results were also found to be able to detect up to level five (i.e., semantic equivalents) of the standard classification. However, future experiments will require a larger dataset and fine-tuning of the Siamese network to reduce overfitting and to improve the generalisation of the trained models on plagiarism attacks. | ||
650 | 0 | |a Deep learning (Machine learning) | |
650 | 0 | |a Neural networks (Computer science) |9 4136 | |
655 | |a Theses, IIUM local | ||
690 | |a Dissertations, Academic |x Department of Computer Science |z IIUM |9 7412 | ||
700 | 0 | |a Suriani Sulaiman |e degree supervisor |9 7410 | |
700 | 0 | |a Normi Sham Awang Abu Bakar |e degree supervisor |9 7411 | |
710 | 2 | |a International Islamic University Malaysia. |b Department of Computer Science |9 7413 | |
856 | 4 | |u http://studentrepo.iium.edu.my/handle/123456789/10996 | |
900 | |a sz-asbh | ||
942 | |2 lcc |c THESIS |n 0 | ||
999 | |c 502861 |d 534278 | ||
952 | |0 0 |1 0 |2 lcc |4 0 |6 T Q 00325.00073 M00266D 02021 |7 3 |8 IIUMTHESIS |9 982084 |a IIUM |b IIUM |c THESIS |d 2022-07-13 |g 0.00 |o t Q 325.73 M266D 2021 |p 11100437189 |r 1900-01-02 |t 1 |v 0.00 |y THESIS |