An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification

Pengelasan teks automatik adalah penting kerana peningkatan bilangan dokumen digital dan oleh itu ia perlu diurus. Kaedah pemodelan statistik terkini tidak memberi maklumat berguna yang mencukupi tentang topik untuk setiap ciri dan kategori. Tambahan pula, penyarian sifat menggunakan frekuensi kata-...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Kadhim, Ammar Ismael
التنسيق:	أطروحة
اللغة:	English
منشور في:	2016
الموضوعات:	QA75.5-76.95 Electronic computers Computer science
الوصول للمادة أونلاين:	http://eprints.usm.my/31479/1/AMMAR_ISMAEL_KADHIM_24.pdf
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

id	my-usm-ep.31479
record_format	uketd_dc
spelling	my-usm-ep.314792019-04-12T05:25:22Z An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification 2016-03 Kadhim, Ammar Ismael QA75.5-76.95 Electronic computers. Computer science Pengelasan teks automatik adalah penting kerana peningkatan bilangan dokumen digital dan oleh itu ia perlu diurus. Kaedah pemodelan statistik terkini tidak memberi maklumat berguna yang mencukupi tentang topik untuk setiap ciri dan kategori. Tambahan pula, penyarian sifat menggunakan frekuensi kata-frekuensi dokumen songsang (TF-IDF) tradisional menghasilkan pengenalan kategori yang terlalu banyak untuk sesuatu dokumen. Dalam usaha pengelasan pula, kaedah k-jiran terdekat (k-NN) sedia ada dengan jarak Euclid dan skor keserupaan kosinus menghasilkan julat varians yang besar dalam prestasinya. Untuk menangani isu ini, kajian ini mengelaskan topik untuk teks pendek dan panjang dengan menggunakan pendekatan baharu untuk tahap-tahap utama pengelasan teks (iaitu penyarian sifat dan pengelasan teks). Kajian ini juga memperkenalkan TD-IDF dengan logaritma dan k-NN dengan skor keserupaan kosinus yang baharu untuk penyarian sifat dan pengelasan masing-masing. Lagipun, faktor yang memberi kesan terhadap prestasi pembelajaran mesin berselia juga dikenalpasti. Automatic text classification is important because of the increased availability of digital documents and therefore the need to organize them. The current state-of-the-art statistical modeling approaches do not provide sufficient useful information on the topics for each feature and category. Furthermore, feature extraction using traditional term frequency-inverse document frequency (TF-IDF) results in the identification of too many categories for a particular document. In terms of classification, current k-NN approaches with Euclidean distance and cosine similarity score produce a wide range of variance in performance. To address these issues, this study classifies topics for short and long texts using a new method for the main stage (i.e., feature extraction and text classification). The study also introduces TF-IDF with logarithm and k-NN with a new cosine similarity score for feature extraction and classification, respectively. 2016-03 Thesis http://eprints.usm.my/31479/ http://eprints.usm.my/31479/1/AMMAR_ISMAEL_KADHIM_24.pdf application/pdf en public phd doctoral Universiti Sains Malaysia Pusat Pengajian Sains Komputer (School of Computer Sciences)
institution	Universiti Sains Malaysia
collection	USM Institutional Repository
language	English
topic	QA75.5-76.95 Electronic computers Computer science
spellingShingle	QA75.5-76.95 Electronic computers Computer science Kadhim, Ammar Ismael An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification
description	Pengelasan teks automatik adalah penting kerana peningkatan bilangan dokumen digital dan oleh itu ia perlu diurus. Kaedah pemodelan statistik terkini tidak memberi maklumat berguna yang mencukupi tentang topik untuk setiap ciri dan kategori. Tambahan pula, penyarian sifat menggunakan frekuensi kata-frekuensi dokumen songsang (TF-IDF) tradisional menghasilkan pengenalan kategori yang terlalu banyak untuk sesuatu dokumen. Dalam usaha pengelasan pula, kaedah k-jiran terdekat (k-NN) sedia ada dengan jarak Euclid dan skor keserupaan kosinus menghasilkan julat varians yang besar dalam prestasinya. Untuk menangani isu ini, kajian ini mengelaskan topik untuk teks pendek dan panjang dengan menggunakan pendekatan baharu untuk tahap-tahap utama pengelasan teks (iaitu penyarian sifat dan pengelasan teks). Kajian ini juga memperkenalkan TD-IDF dengan logaritma dan k-NN dengan skor keserupaan kosinus yang baharu untuk penyarian sifat dan pengelasan masing-masing. Lagipun, faktor yang memberi kesan terhadap prestasi pembelajaran mesin berselia juga dikenalpasti. Automatic text classification is important because of the increased availability of digital documents and therefore the need to organize them. The current state-of-the-art statistical modeling approaches do not provide sufficient useful information on the topics for each feature and category. Furthermore, feature extraction using traditional term frequency-inverse document frequency (TF-IDF) results in the identification of too many categories for a particular document. In terms of classification, current k-NN approaches with Euclidean distance and cosine similarity score produce a wide range of variance in performance. To address these issues, this study classifies topics for short and long texts using a new method for the main stage (i.e., feature extraction and text classification). The study also introduces TF-IDF with logarithm and k-NN with a new cosine similarity score for feature extraction and classification, respectively.
format	Thesis
qualification_name	Doctor of Philosophy (PhD.)
qualification_level	Doctorate
author	Kadhim, Ammar Ismael
author_facet	Kadhim, Ammar Ismael
author_sort	Kadhim, Ammar Ismael
title	An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification
title_short	An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification
title_full	An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification
title_fullStr	An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification
title_full_unstemmed	An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification
title_sort	improved k-nearest neighbors approach using modified term weighting and similarity coefficient for text classification
granting_institution	Universiti Sains Malaysia
granting_department	Pusat Pengajian Sains Komputer (School of Computer Sciences)
publishDate	2016
url	http://eprints.usm.my/31479/1/AMMAR_ISMAEL_KADHIM_24.pdf
_version_	1747820433743282176

An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification

مواد مشابهة