PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval

The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will disintegrate at the end of the model definition. This procedure often leads to down Playing word correlation with topics and reducing retrieved documents�...

Full description

Saved in:
Bibliographic Details
Main Author: Bakari, Ibrahim Bala
Format: Thesis
Language:English
English
English
Published: 2021
Subjects:
Online Access:http://eprints.uthm.edu.my/4890/1/24p%20IBRAHIM%20BALA%20BAKARI.pdf
http://eprints.uthm.edu.my/4890/2/IBRAHIM%20BALA%20BAKARI%20COPYRIGHT%20DECLARATION.pdf
http://eprints.uthm.edu.my/4890/3/IBRAHIM%20BALA%20BAKARI%20WATERMARK.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uthm-ep.4890
record_format uketd_dc
spelling my-uthm-ep.48902022-02-03T03:08:46Z PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval 2021-07 Bakari, Ibrahim Bala QA76 Computer software T Technology (General) The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will disintegrate at the end of the model definition. This procedure often leads to down Playing word correlation with topics and reducing retrieved documents' precision or accuracy. The existing document model, such as the Latent Dirichlet Allocation (LDA) model, does not accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma Latent Dirichlet Allocation (PGLDA) model for modeling word dependencies in topic modeling is introduced. The PGLDA model relaxes the words independence assumption in the existing Latent Dirichlet Allocation (LDA) model by introducing the Gamma distribution that captures the correlation between adjacent words in documents. The PGLDA is hybridized with the distributed representation of documents (Doc2Vec) and topics (Topic2Vec) to form a new model named PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec and Topic2Vec vectors to form new word representation vectors, combined with topics with the largest estimated probability using PGLDA. Model estimations for PGLDA and PGLDA2Vec models were achieved by combining the Laplacian approximation of log-likelihood for PGLDA and Feed-Forward Neural Network (FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the hybrid PGLDA2Vec models were assessed using precision, micro F1 scores, perplexity, and coherence score. The empirical analysis results using three real-world datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of 96.3%, across the three datasets is better than other competing models reviewed. 2021-07 Thesis http://eprints.uthm.edu.my/4890/ http://eprints.uthm.edu.my/4890/1/24p%20IBRAHIM%20BALA%20BAKARI.pdf text en public http://eprints.uthm.edu.my/4890/2/IBRAHIM%20BALA%20BAKARI%20COPYRIGHT%20DECLARATION.pdf text en staffonly http://eprints.uthm.edu.my/4890/3/IBRAHIM%20BALA%20BAKARI%20WATERMARK.pdf text en validuser phd doctoral Universiti Tun Hussein Malaysia Fakulti Sains Komputer dan Teknologi Maklumat
institution Universiti Tun Hussein Onn Malaysia
collection UTHM Institutional Repository
language English
English
English
topic QA76 Computer software
T Technology (General)
spellingShingle QA76 Computer software
T Technology (General)
Bakari, Ibrahim Bala
PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
description The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will disintegrate at the end of the model definition. This procedure often leads to down Playing word correlation with topics and reducing retrieved documents' precision or accuracy. The existing document model, such as the Latent Dirichlet Allocation (LDA) model, does not accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma Latent Dirichlet Allocation (PGLDA) model for modeling word dependencies in topic modeling is introduced. The PGLDA model relaxes the words independence assumption in the existing Latent Dirichlet Allocation (LDA) model by introducing the Gamma distribution that captures the correlation between adjacent words in documents. The PGLDA is hybridized with the distributed representation of documents (Doc2Vec) and topics (Topic2Vec) to form a new model named PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec and Topic2Vec vectors to form new word representation vectors, combined with topics with the largest estimated probability using PGLDA. Model estimations for PGLDA and PGLDA2Vec models were achieved by combining the Laplacian approximation of log-likelihood for PGLDA and Feed-Forward Neural Network (FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the hybrid PGLDA2Vec models were assessed using precision, micro F1 scores, perplexity, and coherence score. The empirical analysis results using three real-world datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of 96.3%, across the three datasets is better than other competing models reviewed.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Bakari, Ibrahim Bala
author_facet Bakari, Ibrahim Bala
author_sort Bakari, Ibrahim Bala
title PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
title_short PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
title_full PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
title_fullStr PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
title_full_unstemmed PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
title_sort pglda: enhancing the precision of topic modelling using poisson gamma (pg) and latent dirichlet allocation (lda) for text information retrieval
granting_institution Universiti Tun Hussein Malaysia
granting_department Fakulti Sains Komputer dan Teknologi Maklumat
publishDate 2021
url http://eprints.uthm.edu.my/4890/1/24p%20IBRAHIM%20BALA%20BAKARI.pdf
http://eprints.uthm.edu.my/4890/2/IBRAHIM%20BALA%20BAKARI%20COPYRIGHT%20DECLARATION.pdf
http://eprints.uthm.edu.my/4890/3/IBRAHIM%20BALA%20BAKARI%20WATERMARK.pdf
_version_ 1747831058317967360