Multi-document text summarization using text clustering for Arabic Language

The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text...

Full description

Saved in:
Bibliographic Details
Main Author: Waheeb, Samer Abdulateef
Format: Thesis
Language:eng
eng
Published: 2014
Subjects:
Online Access:https://etd.uum.edu.my/4373/1/s812273.pdf
https://etd.uum.edu.my/4373/7/s812273_abstract.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.4373
record_format uketd_dc
institution Universiti Utara Malaysia
collection UUM ETD
language eng
eng
advisor Husni, Husniza
Ahmad, Faudziah
topic QA76 Computer software
spellingShingle QA76 Computer software
Waheeb, Samer Abdulateef
Multi-document text summarization using text clustering for Arabic Language
description The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text summarization is redundancy of sentences, and thus, redundancy must be eliminated to ensure coherence, and improve readability. Hence, we set out the main objective as to examine multi-document summarization salient information for text Arabic summarization task with noisy and redundancy information. In this research we used Essex Arabic Summaries Corpus (EASC) as data to test and achieve our main objective and of course its subsequent subobjectives. We used the token process to split the original text into words, and then removed all the stop words, and then we extract the root of each word, and then represented the text as bag of words by TFIDF without the noisy information. In the second step we applied the K-means algorithm with cosine similarity in our experimental to select the best cluster based on cluster ordering by distance performance. We applied SVM to order the sentences after selected the best cluster, then we selected the highest weight sentences for the final summary to reduce redundancy information. Finally, the final summary results for the ten categories of related documents are evaluated using Recall and Precision with the best Recall achieved is 0.6 and Precision is 0.6.
format Thesis
qualification_name masters
qualification_level Master's degree
author Waheeb, Samer Abdulateef
author_facet Waheeb, Samer Abdulateef
author_sort Waheeb, Samer Abdulateef
title Multi-document text summarization using text clustering for Arabic Language
title_short Multi-document text summarization using text clustering for Arabic Language
title_full Multi-document text summarization using text clustering for Arabic Language
title_fullStr Multi-document text summarization using text clustering for Arabic Language
title_full_unstemmed Multi-document text summarization using text clustering for Arabic Language
title_sort multi-document text summarization using text clustering for arabic language
granting_institution Universiti Utara Malaysia
granting_department Awang Had Salleh Graduate School of Arts & Sciences
publishDate 2014
url https://etd.uum.edu.my/4373/1/s812273.pdf
https://etd.uum.edu.my/4373/7/s812273_abstract.pdf
_version_ 1776103639911235584
spelling my-uum-etd.43732023-01-17T07:54:34Z Multi-document text summarization using text clustering for Arabic Language 2014 Waheeb, Samer Abdulateef Husni, Husniza Ahmad, Faudziah Awang Had Salleh Graduate School of Arts & Sciences Awang Had Salleh Graduate School of Arts and Sciences QA76 Computer software The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text summarization is redundancy of sentences, and thus, redundancy must be eliminated to ensure coherence, and improve readability. Hence, we set out the main objective as to examine multi-document summarization salient information for text Arabic summarization task with noisy and redundancy information. In this research we used Essex Arabic Summaries Corpus (EASC) as data to test and achieve our main objective and of course its subsequent subobjectives. We used the token process to split the original text into words, and then removed all the stop words, and then we extract the root of each word, and then represented the text as bag of words by TFIDF without the noisy information. In the second step we applied the K-means algorithm with cosine similarity in our experimental to select the best cluster based on cluster ordering by distance performance. We applied SVM to order the sentences after selected the best cluster, then we selected the highest weight sentences for the final summary to reduce redundancy information. Finally, the final summary results for the ten categories of related documents are evaluated using Recall and Precision with the best Recall achieved is 0.6 and Precision is 0.6. 2014 Thesis https://etd.uum.edu.my/4373/ https://etd.uum.edu.my/4373/1/s812273.pdf text eng public https://etd.uum.edu.my/4373/7/s812273_abstract.pdf text eng public masters masters Universiti Utara Malaysia Abbas, M., Smaili, K., & Berkani, D. (2009a). Comparing TR-Classifier and KNN by using Reduced Sizes of Vocabularies. Culture, 1, 210. Abbas, M., Smaili, K., & Berkani, D. (2009b). A trigger-based classifier. Paper presented at the The 2nd Int. Conf. on Arabic Language Resources and Tools (MEDAR 2009). Agarwal, N., Reddy, R. S., Gvr, K., & Rosé, C. P. (2011). Towards multi-document summarization of scientific articles: making interesting comparisons with SciSumm. ACL HLT 2011, 8. Al-Sulaiti, L., & Atwell, E. (2004). Designing and developing a corpus of contemporary Arabic. University of Leeds (School of Computing). Al-Sulaiti, L., & Atwell, E. (2006). The design of a corpus of Contemporary Arabic. International Journal of Corpus Linguistics, 11(2). Albared, M., Omar, N., & Ab Aziz, M. J. (2009). Classifiers combination to arabic morphosyntactic disambiguation. Paper presented at the Electrical Engineering and Informatics, 2009. ICEEI'09. International Conference on. Amini, M., & Usunier, N. (2007). A contextual query expansion approach by term clustering for robust text summarization. Azmi, A., & Al-thanyyan, S. (2009) Ikhtasir—A user selected compression ratio Arabic text summarization system. Paper presented at the Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on. Azmi, A. M., & Al-Thanyyan, S. (2012). A text summarizer for Arabic. Computer Speech & Language, 26(4), 260-273. Berger, A., & Mittal, V. O. (2000). Query-relevant summarization using FAQs. Paper presented at the Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Boudabous, M. M., Maaloul, M. H., & Belguith, L. H. (2010). Digital learning for summarizing Arabic documents Advances in Natural Language Processing (pp. 79-84): Springer. Christensen, J., Mausam, S. S., & Etzioni, O. (2013). Towards Coherent Multi-Document Summarization. Paper presented at the Proceedings of Association for Computational Linguistics pages 1163–1173, Atlanta, Georgia. Conroy, J. M., O’Leary, D. P., & Schlesinger, J. D. (2006). CLASSY Arabic and English multi-document summarization. Multi-Lingual Summarization Evaluation, 2006. Das, D., & Martins, A. F. (2007)A survey on automatic text summarization. Literature Survey for the Language and Statistics, 4, 192-195. Davenport, M. (2012). Introduction to Modern Information Retrieval. Journal of the Medical Library Association: JMLA, 100(1), 75. Deshpande, A. R., & Lobo, L. (2013). Text Summarization using Clustering Technique. International Journal of Engineering Trends and Technology (IJETT)- Volume 4, Issue8 Douzidia, F. S., & Lapalme, G. (2004). Lakhas, an Arabic summarization system. Proceedings of Document Understanding Conferences 2004. El-Haj, M., Kruschwitz, U., & Fox, C. (2011). Multi-document Arabic text summarisation. Paper presented at the Computer Science and Electronic Engineering Conference (CEEC), 2011 3rd. El-Haj, M. O., & Hammo, B. H. (2008) Evaluation of query-based Arabic text summarization system. Paper presented at the Natural Language Processing and Knowledge Engineering, 2008. NLP-KE'08. Erkan, G., & Radev, D. R. (2004). LexRank: Graph-based lexical centrality as salience in text summarization. (JAIR), 22(1), 457-479. Fan, J., Gao, Y., Luo, H., Keim, D. A., & Li, Z. (2008). A novel approach to enable semantic and visual image summarization for exploratory image search. Paper presented at the Proceedings of the 1st ACM international conference on Multimedia information retrieval. Fisher, S., & Roark, B. (2007). Feature expansion for query-focused supervised sentence ranking. Paper presented at the Document Understanding (DUC 2007) Workshop Papers and Agenda. Fiszman, M., Demner-Fushman, D., Kilicoglu, H., & Rindflesch, T. C. (2009). Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic- oriented evaluation. Journal of biomedical informatics, 42(5), 801-813. Fukumoto, F., Sakai, A., & Suzuki, Y. (2010). Eliminating redundancy by spectral relaxation for multi-document summarization. Paper presented at the Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing. Galanis, D., & Malakasiotis, P. (2008). Aueb at tac 2008. Paper presented at the Proceedings of the TAC 2008 Workshop. Gholamrezazadeh, S., Salehi, M. A., & Gholamzadeh, B. (2009). A comprehensive survey on text summarization systems. 9, 1-6. Ghwanmeh, S. H. (2005). Applying Clustering of hierarchical K-means- like Algorithm on Arabic Language. International Journal of Information Technology, 3(3). Giannakopoulos, G., Karkaletsis, V., Vouros, G., & Stamatopoulos, P. (2008). Summarization system evaluation revisited: N-gram graphs. ACM Transactions on Speech and Language Processing (TSLP), 5(3), 5. Gupta, M. V., Chauhan, M. P., Garg, S., Borude, M. A., & Krishnan, S. (2012). An Statistical Tool for Multi-Document Summarization. International Journal of Scientific and Research Publications 2(5). Gupta, V., & Lehal, G. S. (2010). A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence, 2(3), 258-268. Habash, N. Y. (2010). Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies, 3(1), 1-187. Haboush, A., & Al-Zoubi, M. (2012). Arabic Text Summerization Model Using Clustering Techniques World of Computer Science and Information Technology Journal (WCSIT) ISSN: 2221-0741 Vol. 2, No. 3, 62–67, 2012. Hammo, B., Abu-Salem, H., & Lytinen, S. (2002). QARAB: A question answering system to support the Arabic language. Paper presented at the Proceedings of the ACL-02 workshop on Computational approaches to semitic languages. Hariharan, S., Ramkumar, T., & Srinivasan, R. (2012)Enhanced Graph Based Approach for Multi Document Summarization. The International Arab Journal of Information Technology, 4460-4411. He, L., Sanocki, E., Gupta, A., & Grudin, J. (1999) Auto- summarization of audio-video presentations. Paper presented at the Proceedings of the seventh ACM international conference on Multimedia (Part 1). Ibrahim, A., & Elghazaly, T. (2012). Arabic text summarization using Rhetorical Structure Theory. Paper presented at the Informatics and Systems (INFOS), 2012 8th International Conference on. Ibrahim, A., Elghazaly, T., & Gheith, M. (2013). A Novel Arabic Text Summarization Model Based on Rhetorical Structure Theory and Vector Space Model. Jayashree, R., Murthy, S., & Anami, B. (2012). Categorized Text Document Summarization in the Kannada Language by sentence ranking. Paper presented at the Intelligent Systems Design and Applications (ISDA), 2012 12th International Conference on. Katragadda, R. (2010). GEMS: generative modeling for evaluation of summaries Computational Linguistics and Intelligent Text Processing (pp. 724-735): Springer. Kaur, R., & Bhathal, G. S. (2013) A Survey of Clustering Techniques. International Journal of Advanced Research in Computer Science and Software Engineering, 3(5). Kumar, Y. J., & Salim, N. (2011). Automatic multi document summarization approaches. Journal of Computer Science, 8(1), 133. Larson, M. (2011). Automatic Summarization. Foundations and Trends® in Information Retrieval, 5(3), 235-422. Lin, C.-Y., & Hovy, E. (2002). From single to multi-document summarization: A prototype system and its evaluation. Paper presented at the Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Liu, D.-X., He, Y.-X., Ji, D.-H., & Yang, H. (2006). A Novel Chinese Multi-Document Summarization Using Clustering Based Sentence Extraction. Paper presented at the Machine Learning and Cybernetics, 2006 International Conference on. Lloret, E., & Palomar, M. (2010). Challenging Issues of Automatic Summarization: Relevance Detection and Quality-based Evaluation. Informatica (Slovenia), 34(1), 29-35. Lloret, E., & Palomar, M. (2012). Text summarisation in progress: a literature review. Artificial Intelligence Review, 37(1), 1-41. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2), 159-165. McKeown, A. N. a. K. (2011). Automatic Summarization. The Essence of Knowledge, 5. Minaei-Bidgoli, B., Parvin, H., Alinejad-Rokny, H., Alizadeh, H., & Punch, W. F. (2014). Effects of resampling method and adaptation on clustering ensemble efficacy. Artificial Intelligence Review, 41(1), 27-48. Ouyang, Y., Li, W., Li, S., & Lu, Q. (2011). Applying regression models to queryfocused multi-document summarization. Information processing & management, 47(2), 227-237. Owczarzak, K. (2009) Depeval (summ): dependency- based evaluation for automatic summaries. Paper presented at the Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Radev, D. R., Blair-Goldensohn, S., & Zhang, Z. (2001). Experiments in single and multi-document summarization using MEAD. Ann Arbor, 1001, 48109. Radev, D.R., Hovy, E., & McKeown, K. (2002) Introduction to the special issue on summarization. Computational linguistics, 28(4), 399-408. Rai, P., & Singh, S. (2010). A survey of clustering techniques. International Journal of Computer Applications, 7(12), 156- 162. Saad, M. K. (2010). Open Source Arabic Language and Text Mining Tools. International Conference on Electrical and Computer Systems (EECS’10). Saad, M. K., & Ashour, W. (2010). Arabic Morphological Tools for Text Mining. Corpora, 18, 19. Said, D., Wanas, N. M., Darwish, N. M., & Hegazy, N. (2009). A study of text preprocessing tools for Arabic text categorization. Paper presented at the The Second International Conference on Arabic Language. Schilder, F., Kondadadi, R., Leidner, J. L., & Conrad, J. G. (2008). Thomson reuters at tac 2008: Aggressive filtering with fastsum for update and opinion summarization. Paper presented at the Proceedings of the first Text Analysis Conference, TAC-2008. Schlesinger, J. D., O’leary, D. P., & Conroy, J. M. (2008). Arabic/English multidocument summarization with CLASSY—the past and the future Computational Linguistics and Intelligent Text Processing (pp. 568-581): Springer. Sobh, I., Darwish, N., & Fayek, M. (2009). Evaluation Approaches for an Arabic Extractive Generic Text Summarization System. Paper presented at the proceeding of 2nd International Conference on Arabic Language Resource and Tools. Suanmali, L., & Salim, N. (2009). Literature Reviews for Multi-Document Summarization. Sun, J.-T., Shen, D., Zeng, H.-J., Yang, Q., Lu, Y., & Chen, Z. (2005). Web-page summarization using clickthrough data. Paper presented at the Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. Tiedan Zhu, X. Z. (2012). An Improved Approach to Sentence Ordering For Multidocument Summarization. IACSIT Press, Singapore, 25. Vishal Gupta , G. S. L. (2010). A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence, 2(3), 258-268. Wadhvani, R., Pateriya, R., & Roy, D. (2013). A Topic-driven Summarization using Kmean Clustering and Tf-Isf Sentence Ranking. International Journal of Computer Applications, 79. Wan, X. (2007). TimedTextRank: adding the temporal dimension to multi-document summarization. Paper presented at the Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. Wan, X. (2008). An exploration of document impact on graph-based multi-document summarization. Paper presented at the Proceedings of the Conference on Empirical Methods in Natural Language Processing. Wan, X., & Yang, J. (2008). Multi-document summarization using cluster-based link analysis. Paper presented at the Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. Wang, L., Raghavan, H., Castelli, V., Florian, R., & Cardie, C. (2013). A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization. Paper presented at the Proceedings of ACL. Yang, Z., Lin, Y., Wu, J., Tang, N., Lin, H., & Li, Y. (2011). Ranking support vector machine for multiple kernels output combination in protein–protein interaction extraction from biomedical literature. Proteomics, 11(19), 3811-3817. Zechner, K., & Waibel, A. (2000). DIASUMM: Flexible summarization of spontaneous dialogues in unrestricted domains. Paper presented at the Proceedings of the 18th conference on Computational linguistics-Volume 2.