Multi-document text summarization using text clustering for Arabic Language

The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text...

Full description

Saved in:
Bibliographic Details
Main Author: Waheeb, Samer Abdulateef
Format: Thesis
Language:eng
eng
Published: 2014
Subjects:
Online Access:https://etd.uum.edu.my/4373/1/s812273.pdf
https://etd.uum.edu.my/4373/7/s812273_abstract.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text summarization is redundancy of sentences, and thus, redundancy must be eliminated to ensure coherence, and improve readability. Hence, we set out the main objective as to examine multi-document summarization salient information for text Arabic summarization task with noisy and redundancy information. In this research we used Essex Arabic Summaries Corpus (EASC) as data to test and achieve our main objective and of course its subsequent subobjectives. We used the token process to split the original text into words, and then removed all the stop words, and then we extract the root of each word, and then represented the text as bag of words by TFIDF without the noisy information. In the second step we applied the K-means algorithm with cosine similarity in our experimental to select the best cluster based on cluster ordering by distance performance. We applied SVM to order the sentences after selected the best cluster, then we selected the highest weight sentences for the final summary to reduce redundancy information. Finally, the final summary results for the ten categories of related documents are evaluated using Recall and Precision with the best Recall achieved is 0.6 and Precision is 0.6.