Feature selection method of web page language identification

Globalization has led to a significant increase in the information flow between geographically remote locations with the realization of a common global market. When building a web site for use by various industries, developers need to deal with a wide range of users from different countries. Thus, a...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Ng, Choon Ching
التنسيق:	أطروحة
اللغة:	English
منشور في:	2010
الموضوعات:	QA75 Electronic computers Computer science
الوصول للمادة أونلاين:	http://eprints.utm.my/id/eprint/11497/1/NgChoonChingMFSKSM2010.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	my-utm-ep.11497
record_format	uketd_dc
spelling	my-utm-ep.114972018-08-26T04:53:03Z Feature selection method of web page language identification 2010-02 Ng, Choon Ching QA75 Electronic computers. Computer science Globalization has led to a significant increase in the information flow between geographically remote locations with the realization of a common global market. When building a web site for use by various industries, developers need to deal with a wide range of users from different countries. Thus, a multilingual system must be implemented in order to provide the proper environment for those applications. Different languages can be produced by using the same script such as English, Malay, Spanish, etc., that uses Roman script. The issue is how to produce the reliable features of a web page that is to undergo language identification. Incorrectly identifying the language will results in garbled translations, faulty and incomplete analyses. The aim of this study is to enhance the effectiveness of feature selection method of web page language identification. A letter weighting method as feature selection embedded with fuzzy Adaptive Resonance Theory Map (ARTMAP) and simplified entropy embedded with decision tree are proposed to identify the language belonging to a web page. The methodology contains four major stages, namely; data preparation, data preprocessing, feature selection and identification. Data is collected from news website and then fed into preprocessing to filter out the noises. Feature selection reduces unnecessary attributes of the data in a proper feature representation. Language identification is to determine the predefined language of data. The scripts of languages such as Arabic, Hanzi, Roman, Indic and Cyrillic were used for the performance evaluation of web page language identification. Standard measurements such as T-test, f -fold cross validation, precision, recall and F1 measurements were used on results of the analysis. From the experimental analysis, it is observed that the simplified entropy outperforms the N-grams, entropy and letter weighting feature selection with an accuracy of 98.90%, 81.35%, 96.08% and 93.16%, respectively. The finding concludes that the proposed letter weighting and simplified entropy feature selection methods of web page language identification give promising results in terms of accuracy and retrieval performance at the letter representation level of web pages. 2010-02 Thesis http://eprints.utm.my/id/eprint/11497/ http://eprints.utm.my/id/eprint/11497/1/NgChoonChingMFSKSM2010.pdf application/pdf en public masters Universiti Teknologi Malaysia, Faculty of Computer Science and Information Systems Faculty of Computer Science and Information System
institution	Universiti Teknologi Malaysia
collection	UTM Institutional Repository
language	English
topic	QA75 Electronic computers Computer science
spellingShingle	QA75 Electronic computers Computer science Ng, Choon Ching Feature selection method of web page language identification
description	Globalization has led to a significant increase in the information flow between geographically remote locations with the realization of a common global market. When building a web site for use by various industries, developers need to deal with a wide range of users from different countries. Thus, a multilingual system must be implemented in order to provide the proper environment for those applications. Different languages can be produced by using the same script such as English, Malay, Spanish, etc., that uses Roman script. The issue is how to produce the reliable features of a web page that is to undergo language identification. Incorrectly identifying the language will results in garbled translations, faulty and incomplete analyses. The aim of this study is to enhance the effectiveness of feature selection method of web page language identification. A letter weighting method as feature selection embedded with fuzzy Adaptive Resonance Theory Map (ARTMAP) and simplified entropy embedded with decision tree are proposed to identify the language belonging to a web page. The methodology contains four major stages, namely; data preparation, data preprocessing, feature selection and identification. Data is collected from news website and then fed into preprocessing to filter out the noises. Feature selection reduces unnecessary attributes of the data in a proper feature representation. Language identification is to determine the predefined language of data. The scripts of languages such as Arabic, Hanzi, Roman, Indic and Cyrillic were used for the performance evaluation of web page language identification. Standard measurements such as T-test, f -fold cross validation, precision, recall and F1 measurements were used on results of the analysis. From the experimental analysis, it is observed that the simplified entropy outperforms the N-grams, entropy and letter weighting feature selection with an accuracy of 98.90%, 81.35%, 96.08% and 93.16%, respectively. The finding concludes that the proposed letter weighting and simplified entropy feature selection methods of web page language identification give promising results in terms of accuracy and retrieval performance at the letter representation level of web pages.
format	Thesis
qualification_level	Master's degree
author	Ng, Choon Ching
author_facet	Ng, Choon Ching
author_sort	Ng, Choon Ching
title	Feature selection method of web page language identification
title_short	Feature selection method of web page language identification
title_full	Feature selection method of web page language identification
title_fullStr	Feature selection method of web page language identification
title_full_unstemmed	Feature selection method of web page language identification
title_sort	feature selection method of web page language identification
granting_institution	Universiti Teknologi Malaysia, Faculty of Computer Science and Information Systems
granting_department	Faculty of Computer Science and Information System
publishDate	2010
url	http://eprints.utm.my/id/eprint/11497/1/NgChoonChingMFSKSM2010.pdf
_version_	1747814863549235200

Feature selection method of web page language identification

مواد مشابهة