The performance of soft computing techniques on content-based SMS spam filtering
Content-based filtering is one of the most widely used methods to combat SMS (Short Message Service) spam. This method represents SMS text messages by a set of selected features which are extracted from data sets. Most of the available data sets have imbalanced class distribution problem. However...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English English |
Published: |
2015
|
Subjects: | |
Online Access: | http://eprints.uthm.edu.my/1496/2/WADDAH%20WAHEEB%20HASSAN%20SAEED%20COPYRIGHT%20DECLARATION.pdf http://eprints.uthm.edu.my/1496/1/24p%20WADDAH%20WAHEEB%20HASSAN%20SAEED.pdf http://eprints.uthm.edu.my/1496/3/WADDAH%20WAHEEB%20HASSAN%20SAEED%20WATERMARK.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-uthm-ep.1496 |
---|---|
record_format |
uketd_dc |
spelling |
my-uthm-ep.14962021-10-03T07:44:57Z The performance of soft computing techniques on content-based SMS spam filtering 2015-02 Hassan Saeed, Waddah Waheeb QA76 Computer software Content-based filtering is one of the most widely used methods to combat SMS (Short Message Service) spam. This method represents SMS text messages by a set of selected features which are extracted from data sets. Most of the available data sets have imbalanced class distribution problem. However, not much attention has been paid to handle this problem which affect the characteristics and size of selected features and cause undesired performance. Soft computing approaches have been applied successfully in content-based spam filtering. In order to enhance soft computing performance, suitable feature subset should be selected. Therefore, this research investigates how well suited three soft computing techniques: Fuzzy Similarity, Artificial Neural Network and Support Vector Machines (SVM) are for content-based SMS spam filtering using an appropriate size of features which are selected by the Gini Index metric as it has the ability to extract suitable features from imbalanced data sets. The data sets used in this research were taken from three sources: UCI repository, Dublin Institute of Technology (DIT) and British English SMS. The performance of each of the technique was compared in terms of True Positive Rate against False Positive Rate, F1 score and Matthews Correlation Coefficient. The results showed that SVM with 150 features outperformed the other techniques in all the comparison measures. The average time needed to classify an SMS text message is a fraction of a millisecond. Another test using NUS SMS corpus was conducted in order to validate the SVM classifier with 150 features. The results again proved the efficiency of the SVM classifier with 150 features for SMS spam filtering with an accuracy of about 99.2%. 2015-02 Thesis http://eprints.uthm.edu.my/1496/ http://eprints.uthm.edu.my/1496/2/WADDAH%20WAHEEB%20HASSAN%20SAEED%20COPYRIGHT%20DECLARATION.pdf text en staffonly http://eprints.uthm.edu.my/1496/1/24p%20WADDAH%20WAHEEB%20HASSAN%20SAEED.pdf text en public http://eprints.uthm.edu.my/1496/3/WADDAH%20WAHEEB%20HASSAN%20SAEED%20WATERMARK.pdf text en validuser mphil masters Universiti Tun Hussein Onn Malaysia Faculty of Computer Science and Information Technology |
institution |
Universiti Tun Hussein Onn Malaysia |
collection |
UTHM Institutional Repository |
language |
English English English |
topic |
QA76 Computer software |
spellingShingle |
QA76 Computer software Hassan Saeed, Waddah Waheeb The performance of soft computing techniques on content-based SMS spam filtering |
description |
Content-based filtering is one of the most widely used methods to combat SMS (Short
Message Service) spam. This method represents SMS text messages by a set of selected
features which are extracted from data sets. Most of the available data sets have
imbalanced class distribution problem. However, not much attention has been paid to
handle this problem which affect the characteristics and size of selected features and
cause undesired performance. Soft computing approaches have been applied successfully
in content-based spam filtering. In order to enhance soft computing performance,
suitable feature subset should be selected. Therefore, this research investigates how
well suited three soft computing techniques: Fuzzy Similarity, Artificial Neural Network
and Support Vector Machines (SVM) are for content-based SMS spam filtering
using an appropriate size of features which are selected by the Gini Index metric as
it has the ability to extract suitable features from imbalanced data sets. The data sets
used in this research were taken from three sources: UCI repository, Dublin Institute of
Technology (DIT) and British English SMS. The performance of each of the technique
was compared in terms of True Positive Rate against False Positive Rate, F1 score and
Matthews Correlation Coefficient. The results showed that SVM with 150 features
outperformed the other techniques in all the comparison measures. The average time
needed to classify an SMS text message is a fraction of a millisecond. Another test
using NUS SMS corpus was conducted in order to validate the SVM classifier with
150 features. The results again proved the efficiency of the SVM classifier with 150
features for SMS spam filtering with an accuracy of about 99.2%. |
format |
Thesis |
qualification_name |
Master of Philosophy (M.Phil.) |
qualification_level |
Master's degree |
author |
Hassan Saeed, Waddah Waheeb |
author_facet |
Hassan Saeed, Waddah Waheeb |
author_sort |
Hassan Saeed, Waddah Waheeb |
title |
The performance of soft computing techniques on content-based SMS spam filtering |
title_short |
The performance of soft computing techniques on content-based SMS spam filtering |
title_full |
The performance of soft computing techniques on content-based SMS spam filtering |
title_fullStr |
The performance of soft computing techniques on content-based SMS spam filtering |
title_full_unstemmed |
The performance of soft computing techniques on content-based SMS spam filtering |
title_sort |
performance of soft computing techniques on content-based sms spam filtering |
granting_institution |
Universiti Tun Hussein Onn Malaysia |
granting_department |
Faculty of Computer Science and Information Technology |
publishDate |
2015 |
url |
http://eprints.uthm.edu.my/1496/2/WADDAH%20WAHEEB%20HASSAN%20SAEED%20COPYRIGHT%20DECLARATION.pdf http://eprints.uthm.edu.my/1496/1/24p%20WADDAH%20WAHEEB%20HASSAN%20SAEED.pdf http://eprints.uthm.edu.my/1496/3/WADDAH%20WAHEEB%20HASSAN%20SAEED%20WATERMARK.pdf |
_version_ |
1747830803005440000 |