An ensemble learning method for spam email detection system based on metaheuristic algorithms

In email spam detection, not only different parts and content of emails are important, but also the structural and special features of these emails have effective rule in dimensionality reduction and classifier accuracy. For example,the spammer changes patterns of message for making spam such as wri...

Full description

Saved in:
Bibliographic Details
Main Author: Behjat, Amir Rajabi
Format: Thesis
Language:English
Published: 2015
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/65264/1/FSKTM%202015%2049IR.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In email spam detection, not only different parts and content of emails are important, but also the structural and special features of these emails have effective rule in dimensionality reduction and classifier accuracy. For example,the spammer changes patterns of message for making spam such as writing the message by JavaScript, using different advertising images and words to form features or attributes. Even the smart people are unable to report an email as a spam when the spammer tries to defraud them. The aim of data mining is to search and find undetermined patterns in huge databases. A well known task is classification that predicts the class of new instances using known features or attributes automatically. Major problems in classification task are large amount of training data, large number of features and different behavior of data streams that reduce accuracy and increase computational cost in classifier training phase. Feature subset selection and classifier ensemble learning are familiar techniques with high ability to optimize above problems. Recently, various techniques based on different algorithms have been developed. However, the classification accuracy and computational cost are not satisfied. In order to address the challenges that mentioned above in this study, in the first phase, a novel architecture based on ensemble feature selection techniques include Modified Binary Bat Algorithm (NBBA), Binary Quantum Particle Swarm Optimization (QBPSO) Algorithm and Binary Quantum Gravita tional Search Algorithm (QBGSA) is hybridized with the Multi-layer Perceptron (MLP) classifier in order to select relevant feature subsets and improve classification accuracy. In the second phase, a classifier ensemble learning model is proposed consisting of separate outputs: (i) To select a relevant subset of original features based on Binary Quantum Gravitational Search Algorithm (QBGSA), (ii) To mine data streams using various data chunks and overcome a failure of single classifiers based on SVM, MLP and K-NN algorithms. An experimental analysis is conducted by several experiments to evaluate the performance of the proposed ensemble methods which has been tested on the 4 benchmark datasets, namely LingSpam, SpamAssassin, Spambase and CSDMC2010. In comparison to different single algorithms for feature selection,experimental results show that the proposed ensemble method is able to reduce dimensionality, the number of irrelevant features and produce reasonable classifier accuracy. Experiments demonstrate that ensemble classifier learning method produces better accuracy mining data streams and selecting subset of relevant features comparing other single classifiers. In addition, experiments prove that the ensemble algorithms select highly relevant features to feed the MLP comparing individual techniques in terms of classifier performance through lower false positive, higher accuracy, and better CPU time.