Complex word identification model for lexical simplification in the Malay language for non-native speakers

Text Simplification (TS) is the process of converting complex text into more easily understandable text. Lexical Simplification (LS), a method in TS, is the task of converting words into simpler words. Past studies have shown weaknesses in the LS first task, called Complex Word Identification (CWI),...

Full description

Saved in:

Bibliographic Details
Main Author:	Salehah, Omar
Format:	Thesis
Language:	eng eng
Published:	2023
Subjects:	T Technology (General)
Online Access:	https://etd.uum.edu.my/10852/1/permission%20to%20deposit-grant%20the%20permission-s825765.pdf https://etd.uum.edu.my/10852/2/s825765_01.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-uum-etd.10852
record_format	uketd_dc
spelling	my-uum-etd.108522023-12-27T04:10:29Z Complex word identification model for lexical simplification in the Malay language for non-native speakers 2023 Salehah, Omar Abu Bakar, Juhaida Mohd Nadzir, Maslinda Awang Had Salleh Graduate School of Arts & Sciences Awang Had Salleh Graduate School of Art & Sciences T Technology (General) Text Simplification (TS) is the process of converting complex text into more easily understandable text. Lexical Simplification (LS), a method in TS, is the task of converting words into simpler words. Past studies have shown weaknesses in the LS first task, called Complex Word Identification (CWI), where simple and complex words have been misidentified in previous CWI model. The main objective of this study is to produce a Malay CWI model with three sub-objectives, i) To propose a dataset based on the state-of-the-art Malay corpus, ii) To produce a Malay CWI model, and iii) To perform an evaluation based on the standard statistical metrics; accuracy, precision, recall, F1-score, and G1-score. This model is constructed based on the development of the CWI model outlined by the previous researcher. This study consists of three modules, i) A Malay CWI dataset, ii) Malay CWI features with the new enhanced stemmer rules, and iii) A CWI model based on the Gradient Boosted Tree (GB) algorithm. The model is evaluated based on a state-of-the-art Malay corpus. This corpus is divided into training and testing data using k-fold cross-validation, where k=10. A series of tests were performed to ensure the best model was produced, including feature selection, generation of an improved stemmer algorithm, data imbalances, and classifier testing. The best model using the Gradient Boost algorithm showed an average accuracy of 92.55%, F1- score of 92.09% and G1-score of 89.7%. The F1-score was better than the English standard baseline score, with an increased difference of 16.3%. Three linguistic experts verified the results for 38 unseen sentences, and the results showed significantly positive results between the model built and the linguistic experts’ assessment. The proposed CWI model has improved the F1- score that has been obtained in second CWI shared task and positively affected non-native speakers and researchers. 2023 Thesis https://etd.uum.edu.my/10852/ https://etd.uum.edu.my/10852/1/permission%20to%20deposit-grant%20the%20permission-s825765.pdf text eng staffonly https://etd.uum.edu.my/10852/2/s825765_01.pdf text eng public other masters Universiti Utara Malaysia
institution	Universiti Utara Malaysia
collection	UUM ETD
language	eng eng
advisor	Abu Bakar, Juhaida Mohd Nadzir, Maslinda
topic	T Technology (General)
spellingShingle	T Technology (General) Salehah, Omar Complex word identification model for lexical simplification in the Malay language for non-native speakers
description	Text Simplification (TS) is the process of converting complex text into more easily understandable text. Lexical Simplification (LS), a method in TS, is the task of converting words into simpler words. Past studies have shown weaknesses in the LS first task, called Complex Word Identification (CWI), where simple and complex words have been misidentified in previous CWI model. The main objective of this study is to produce a Malay CWI model with three sub-objectives, i) To propose a dataset based on the state-of-the-art Malay corpus, ii) To produce a Malay CWI model, and iii) To perform an evaluation based on the standard statistical metrics; accuracy, precision, recall, F1-score, and G1-score. This model is constructed based on the development of the CWI model outlined by the previous researcher. This study consists of three modules, i) A Malay CWI dataset, ii) Malay CWI features with the new enhanced stemmer rules, and iii) A CWI model based on the Gradient Boosted Tree (GB) algorithm. The model is evaluated based on a state-of-the-art Malay corpus. This corpus is divided into training and testing data using k-fold cross-validation, where k=10. A series of tests were performed to ensure the best model was produced, including feature selection, generation of an improved stemmer algorithm, data imbalances, and classifier testing. The best model using the Gradient Boost algorithm showed an average accuracy of 92.55%, F1- score of 92.09% and G1-score of 89.7%. The F1-score was better than the English standard baseline score, with an increased difference of 16.3%. Three linguistic experts verified the results for 38 unseen sentences, and the results showed significantly positive results between the model built and the linguistic experts’ assessment. The proposed CWI model has improved the F1- score that has been obtained in second CWI shared task and positively affected non-native speakers and researchers.
format	Thesis
qualification_name	other
qualification_level	Master's degree
author	Salehah, Omar
author_facet	Salehah, Omar
author_sort	Salehah, Omar
title	Complex word identification model for lexical simplification in the Malay language for non-native speakers
title_short	Complex word identification model for lexical simplification in the Malay language for non-native speakers
title_full	Complex word identification model for lexical simplification in the Malay language for non-native speakers
title_fullStr	Complex word identification model for lexical simplification in the Malay language for non-native speakers
title_full_unstemmed	Complex word identification model for lexical simplification in the Malay language for non-native speakers
title_sort	complex word identification model for lexical simplification in the malay language for non-native speakers
granting_institution	Universiti Utara Malaysia
granting_department	Awang Had Salleh Graduate School of Arts & Sciences
publishDate	2023
url	https://etd.uum.edu.my/10852/1/permission%20to%20deposit-grant%20the%20permission-s825765.pdf https://etd.uum.edu.my/10852/2/s825765_01.pdf
_version_	1794023773725261824

Complex word identification model for lexical simplification in the Malay language for non-native speakers

Similar Items