Text Extraction Algorithm for Web Text Classification

Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web...

Full description

Saved in:
Bibliographic Details
Main Author: Theab, Mustafa Muwafak
Format: Thesis
Language:eng
Published: 2010
Subjects:
Online Access:https://etd.uum.edu.my/2164/1/Mustafa_Muwafak_Theab.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.2164
record_format uketd_dc
spelling my-uum-etd.21642013-07-24T12:14:42Z Text Extraction Algorithm for Web Text Classification 2010 Theab, Mustafa Muwafak Ku Mahamud, Ku Ruhana Ahmad, Faudziah College of Arts and Sciences (CAS) College of Arts and Sciences QA71-90 Instruments and machines Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naive-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naive-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification. 2010 Thesis https://etd.uum.edu.my/2164/ https://etd.uum.edu.my/2164/1/Mustafa_Muwafak_Theab.pdf application/pdf eng validuser http://lintas.uum.edu.my:8080/elmu/index.jsp?module=webopac-l&action=fullDisplayRetriever.jsp&szMaterialNo=0000757917 masters masters Universiti Utara Malaysia
institution Universiti Utara Malaysia
collection UUM ETD
language eng
advisor Ku Mahamud, Ku Ruhana
Ahmad, Faudziah
topic QA71-90 Instruments and machines
spellingShingle QA71-90 Instruments and machines
Theab, Mustafa Muwafak
Text Extraction Algorithm for Web Text Classification
description Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naive-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naive-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification.
format Thesis
qualification_name masters
qualification_level Master's degree
author Theab, Mustafa Muwafak
author_facet Theab, Mustafa Muwafak
author_sort Theab, Mustafa Muwafak
title Text Extraction Algorithm for Web Text Classification
title_short Text Extraction Algorithm for Web Text Classification
title_full Text Extraction Algorithm for Web Text Classification
title_fullStr Text Extraction Algorithm for Web Text Classification
title_full_unstemmed Text Extraction Algorithm for Web Text Classification
title_sort text extraction algorithm for web text classification
granting_institution Universiti Utara Malaysia
granting_department College of Arts and Sciences (CAS)
publishDate 2010
url https://etd.uum.edu.my/2164/1/Mustafa_Muwafak_Theab.pdf
_version_ 1747827280136110080