Using wordnet to enhance feature selection in automated text categorization
the field of automated text categorization, the large dimensionality of the feature space is a major problem as it involves extensive computations. Feature selection is one of the approaches to reduce the dimensionality of the feature space. This research explores the use of WordNet (Miller et al...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2004
|
Subjects: | |
Online Access: | http://ir.unimas.my/id/eprint/12604/1/Stephanie%20Chua%20Hui%20Li%20ft.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | the field of automated text categorization, the large dimensionality of the feature
space is a major problem as it involves extensive computations. Feature selection is
one of the approaches to reduce the dimensionality of the feature space. This research
explores the use of WordNet (Miller et al., 1990), a lexical database, for performing
feature selection for an automated text categorization system. The WordNet-based
approach employs lexical and semantics information for feature selection. WordNet
allows the selection of terms that are lexically and semantically representative of a
category of documents, as opposed to statistical approaches traditionally used for
feature selection. f'
We proposed three WordNet based approaches for feature selection. The first one is
to use the WordNet nouns approach that selects all nouns in WordNet that occur in
each category as features. The second approach is based on lexical semantics that
selects synonymous terms that co-occur in a category while the third approach is a
combination of the lexical semantics approach with statistical feature selection
methods.
The lexical semantics approach performed better than the WordNet nouns approach
with more than 40% of reduction in feature space in the experiments using the
Reuters-21578 dataset. The lexical semantics approach also outperformed popular
statistical feature selection methods, namely, Chi-Square (Chi2) and Information
Gain (IG). The combined approach has improved the performance of the statistical
methods. WordNet has successfully been used to enhance feature selection, highlighting the possibility of determining semantic features automatically. The
limitations of the lexical semantics approach are also highlighted, proposing an
improved framework and an extension to overcome them. |
---|