Enhanced text stemmer for standard and non-standard word patterns in Malay texts

Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years,...

Full description

Saved in:
Bibliographic Details
Main Author: Kassim, Mohamad Nizam
Format: Thesis
Language:English
Published: 2020
Subjects:
Online Access:http://eprints.utm.my/id/eprint/98431/1/MohamadNizamKassimPSC2020.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.98431
record_format uketd_dc
spelling my-utm-ep.984312023-01-08T02:12:33Z Enhanced text stemmer for standard and non-standard word patterns in Malay texts 2020 Kassim, Mohamad Nizam QA75 Electronic computers. Computer science Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application. 2020 Thesis http://eprints.utm.my/id/eprint/98431/ http://eprints.utm.my/id/eprint/98431/1/MohamadNizamKassimPSC2020.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143711 phd doctoral Universiti Teknologi Malaysia, Faculty of Engineering - School of Computing Faculty of Engineering - School of Computing
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA75 Electronic computers
Computer science
spellingShingle QA75 Electronic computers
Computer science
Kassim, Mohamad Nizam
Enhanced text stemmer for standard and non-standard word patterns in Malay texts
description Text stemming is a useful language preprocessing tool in the field of information retrieval, text classification and natural language processing. A text stemmer is a computer program that removes affixes, clitics and particles to obtain the root words from the derived words. Over the past few years, few text stemmers have been developed for the Malay language but unfortunately, these text stemmers suffer from various stemming errors. It is due to the difficulty in dealing with the complexity of the Malay language morphological rules. These text stemmers are developed for text stemming against affixation words only whereas there are other affixation, reduplication and compounding words in the Malay language. Furthermore, none of these text stemmers has been developed for text stemming against social media texts which comprise of the non-standard derived words. Therefore, this research study aims to improve the existing text stemmers capability of stemming affixation, reduplication and compounding words while minimising the possible stemming errors. Moreover, this research study also aims to address text stemming process for non-standard derived words on the social media platforms by removing non-standard affixes, clitics and particles. This research study adopts a multiple text stemming approach that use affix removal method and dictionary lookup in specific arrangement order to correctly stem standard and non-standard affixation, reduplication and compounding words in the standard texts and social media texts. The proposed text stemmer is evaluated against various text documents using the direct evaluation method and the text classification is used as the indirect evaluation method to validate the effectiveness of the proposed enhanced text stemmer. In general, the proposed enhanced text stemmer outperforms the baseline text stemmer. The stemming accuracy of the proposed enhanced text stemmer achieves an average of 98.7% against the standard texts and an average of 73.7% against the social media texts. Meanwhile, the performance of the proposed enhanced text stemmer in the sports news classification application achieves an average of 85% accuracy and the illicit content classification application achieves an average of 75% accuracy. Meanwhile, the baseline text stemmer achieves an average of 63.5% stemming accuracy against the standard texts but unfortunately, it is unable to stem non-standard derived words in the social media texts. The baseline text stemmer performs poorly in sports news classification and illicit content classification with an average accuracy of 78% and 63% respectively. In short, the experimental results suggest that the proposed enhanced text stemmer has promising stemming accuracy for text stemming against the standard texts and social media texts. It also influences the performance of the text classification application.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Kassim, Mohamad Nizam
author_facet Kassim, Mohamad Nizam
author_sort Kassim, Mohamad Nizam
title Enhanced text stemmer for standard and non-standard word patterns in Malay texts
title_short Enhanced text stemmer for standard and non-standard word patterns in Malay texts
title_full Enhanced text stemmer for standard and non-standard word patterns in Malay texts
title_fullStr Enhanced text stemmer for standard and non-standard word patterns in Malay texts
title_full_unstemmed Enhanced text stemmer for standard and non-standard word patterns in Malay texts
title_sort enhanced text stemmer for standard and non-standard word patterns in malay texts
granting_institution Universiti Teknologi Malaysia, Faculty of Engineering - School of Computing
granting_department Faculty of Engineering - School of Computing
publishDate 2020
url http://eprints.utm.my/id/eprint/98431/1/MohamadNizamKassimPSC2020.pdf
_version_ 1776100586983260160