An improved framework for content and link-based web spam detection: a combined approach

In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pag...

Full description

Saved in:
Bibliographic Details
Main Author: Shahzad, Asim
Format: Thesis
Language:English
English
English
Published: 2021
Subjects:
Online Access:http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf
http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf
http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uthm-ep.1777
record_format uketd_dc
spelling my-uthm-ep.17772021-10-11T07:58:48Z An improved framework for content and link-based web spam detection: a combined approach 2021-05 Shahzad, Asim QA76.75-76.765 Computer software In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability. 2021-05 Thesis http://eprints.uthm.edu.my/1777/ http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf text en staffonly http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf text en public http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf text en validuser phd doctoral Universiti Tun Hussein Onn Malaysia Faculty of Computer Science and Information Technology
institution Universiti Tun Hussein Onn Malaysia
collection UTHM Institutional Repository
language English
English
English
topic QA76.75-76.765 Computer software
spellingShingle QA76.75-76.765 Computer software
Shahzad, Asim
An improved framework for content and link-based web spam detection: a combined approach
description In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Shahzad, Asim
author_facet Shahzad, Asim
author_sort Shahzad, Asim
title An improved framework for content and link-based web spam detection: a combined approach
title_short An improved framework for content and link-based web spam detection: a combined approach
title_full An improved framework for content and link-based web spam detection: a combined approach
title_fullStr An improved framework for content and link-based web spam detection: a combined approach
title_full_unstemmed An improved framework for content and link-based web spam detection: a combined approach
title_sort improved framework for content and link-based web spam detection: a combined approach
granting_institution Universiti Tun Hussein Onn Malaysia
granting_department Faculty of Computer Science and Information Technology
publishDate 2021
url http://eprints.uthm.edu.my/1777/2/ASIM%20SHAHZAD%20-%20declaration.pdf
http://eprints.uthm.edu.my/1777/1/ASIM%20SHAHZAD%20-%2024p.pdf
http://eprints.uthm.edu.my/1777/3/ASIM%20SHAHZAD%20-%20fulltext.pdf
_version_ 1747830864999350272