An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis

Named Entity Recognition (NER) is one of the tasks undertaken in the information extraction. NER is used for extracting and classifying words or entities that belong to the proper noun category in text data such as the person's name, location, organization, date, etc. As seen in today's ge...

Full description

Saved in:
Bibliographic Details
Main Author: Salleh, Muhammad Sharilazlan
Format: Thesis
Language:English
English
Published: 2018
Subjects:
Online Access:http://eprints.utem.edu.my/id/eprint/23326/1/An%20Enhanced%20Malay%20Named%20Entity%20Recognition%20Using%20Clustering%20and%20Classification%20Approach%20For%20Crime%20Textual%20Data%20Analysis.pdf
http://eprints.utem.edu.my/id/eprint/23326/2/An%20enhanced%20malay%20named%20entity%20recognition%20using%20clustering%20and%20classification%20approach%20for%20crime%20textual%20data%20analysis.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utem-ep.23326
record_format uketd_dc
institution Universiti Teknikal Malaysia Melaka
collection UTeM Repository
language English
English
topic Q Science (General)
QA76 Computer software
spellingShingle Q Science (General)
QA76 Computer software
Salleh, Muhammad Sharilazlan
An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
description Named Entity Recognition (NER) is one of the tasks undertaken in the information extraction. NER is used for extracting and classifying words or entities that belong to the proper noun category in text data such as the person's name, location, organization, date, etc. As seen in today's generation, social media such as web pages, blogs, Facebook, Twitter, Instagram and online newspapers are among the major contributors to information extraction. These resources contain various types of unstructured data such as text. However, the amount of works done to process this type of data is limited for Malay Named Entity Recognition (MNER). The deficiency on Malay textual analytic has led to difficulties in extracting information for decision making. This research aims to present a Malay Named Entity Recognition technique that focuses on crime data analysis in the Malay language that extracted from Polis Diraja Malaysia (PDRM) news web page. This Malay Named Entity Recognition (MNER) technique is proposed by using multi-staged of clustering and classification methods. The methods are Fuzzy C-Means and K-Nearest Neighbors Algorithm. The methods involve multi-layer features extraction to recognize entities such as person name, location, organization, date and crime type. This multi-staged technique is obtained 95.24% accuracy in the process of recognizing named entities for text analysis, particularly in Malay. The proposed technique can improve the accuracy performance on named entity recognition of crime data based on the suitability selected features for the Malay language.
format Thesis
qualification_name Master of Philosophy (M.Phil.)
qualification_level Master's degree
author Salleh, Muhammad Sharilazlan
author_facet Salleh, Muhammad Sharilazlan
author_sort Salleh, Muhammad Sharilazlan
title An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
title_short An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
title_full An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
title_fullStr An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
title_full_unstemmed An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
title_sort enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis
granting_institution Universiti Teknikal Malaysia Melaka
granting_department Faculty of Information and Communication Technology
publishDate 2018
url http://eprints.utem.edu.my/id/eprint/23326/1/An%20Enhanced%20Malay%20Named%20Entity%20Recognition%20Using%20Clustering%20and%20Classification%20Approach%20For%20Crime%20Textual%20Data%20Analysis.pdf
http://eprints.utem.edu.my/id/eprint/23326/2/An%20enhanced%20malay%20named%20entity%20recognition%20using%20clustering%20and%20classification%20approach%20for%20crime%20textual%20data%20analysis.pdf
_version_ 1747834035446480896
spelling my-utem-ep.233262022-04-20T12:25:25Z An enhanced malay named entity recognition using clustering and classification approach for crime textual data analysis 2018 Salleh, Muhammad Sharilazlan Q Science (General) QA76 Computer software Named Entity Recognition (NER) is one of the tasks undertaken in the information extraction. NER is used for extracting and classifying words or entities that belong to the proper noun category in text data such as the person's name, location, organization, date, etc. As seen in today's generation, social media such as web pages, blogs, Facebook, Twitter, Instagram and online newspapers are among the major contributors to information extraction. These resources contain various types of unstructured data such as text. However, the amount of works done to process this type of data is limited for Malay Named Entity Recognition (MNER). The deficiency on Malay textual analytic has led to difficulties in extracting information for decision making. This research aims to present a Malay Named Entity Recognition technique that focuses on crime data analysis in the Malay language that extracted from Polis Diraja Malaysia (PDRM) news web page. This Malay Named Entity Recognition (MNER) technique is proposed by using multi-staged of clustering and classification methods. The methods are Fuzzy C-Means and K-Nearest Neighbors Algorithm. The methods involve multi-layer features extraction to recognize entities such as person name, location, organization, date and crime type. This multi-staged technique is obtained 95.24% accuracy in the process of recognizing named entities for text analysis, particularly in Malay. The proposed technique can improve the accuracy performance on named entity recognition of crime data based on the suitability selected features for the Malay language. 2018 Thesis http://eprints.utem.edu.my/id/eprint/23326/ http://eprints.utem.edu.my/id/eprint/23326/1/An%20Enhanced%20Malay%20Named%20Entity%20Recognition%20Using%20Clustering%20and%20Classification%20Approach%20For%20Crime%20Textual%20Data%20Analysis.pdf text en public http://eprints.utem.edu.my/id/eprint/23326/2/An%20enhanced%20malay%20named%20entity%20recognition%20using%20clustering%20and%20classification%20approach%20for%20crime%20textual%20data%20analysis.pdf text en validuser https://plh.utem.edu.my/cgi-bin/koha/opac-detail.pl?biblionumber=112736 mphil masters Universiti Teknikal Malaysia Melaka Faculty of Information and Communication Technology 1. Abacha, A.B., and Zweigenbaum, P., 2011. Medical Entity recognition: A comparison of semantic and statistical methods. 2011 Workshop on Biomedical Natural Language Processing, pp. 56-64. 2. Abraham, A., 2005. Rule-Based Expert Systems. Handbook of Measuring System Design. 3. Ahmed, I., and Sathyaraj, R., 2015. Named entity recognition by using maximum entropy. International Journal of Database Theory and Application, 8(2), pp. 43-50. 4. Al-Zaidy, R., Fung, B.C., Youssef, A.M., and Fortin, F., 2012. Mining criminal networks from unstructured text documents. Digital Investigation, 8(3-4), pp. 147-160. 5. Alanazi, S., Sharp, B., and Stanier, C., 2015. A Named Entity Recognition System Applied to Arabic Text in the Medical Domain. International Journal of Computer Science Issues, 12(3), pp. 109-117. 6. Alfred, R., Leong, L.C., On, C.K., and Anthony, P., 2014. Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3), pp. 300-306. 7. Alghamdi, R., 2016. Hidden Markov Models (HMMs) and Security Applications. International Journal of Advanced Computer Science and Applications, 7(2), pp. 39-47. 8. Alhawiti, K.M., 2014. Natural Language Processing and its Use in Education. IJACSA) International Journal of Advanced Computer Science and Applications, 5(12), pp. 72-76. 9. Alkaff, A., and Mohd, M., 2013. Extraction of nationality from crime news. Journal of Theoretical and Applied Information Technology, 54(2), pp. 304-312. 10. Aronoff, M., and Fudeman, K., 2005. Thinking about morphology and morphological analysis. In Aronoff, M., Fudeman, K. (Eds.), What is Morphology? Blackwell Publishing, pp. 1–31. 11. Asharef, M., Omar, N., Albared, M., Minhui, Z., Weiming, W., and Jingjing, Z., 2012. Arabic named entity recognition in crime documents. Journal of Theoretical and Applied Information Technology, 44(1), pp. 1-6. 12. Atdağ, S., and Labatut, V., 2013. A comparison of named entity recognition tools applied to biographical texts. 2013 2nd International Conference on Systems and Computer Science, ICSCS 2013, pp. 228-233. 13. Behera, S., and Kumar, N.V., 2015. Filtering of Unstructured Text. International Journal of Engineering Research and Development. 11(12), pp.45-49. 14. Bodnari, A., Deleger, L., Lavergne, T., Neveol, A., and Zweigenbaum, P., 2013. A Supervised Named-Entity Extraction System For Medical Text. CEUR Workshop Proceedings. 15. Boulaknadel, S., Talha, M., and Aboutajdine, D., 2014. Amazighe Named Entity Recognition Using A Rule Based Approach. Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference, pp. 478-484. 16. Castellano, G., Fanelli, A.M., and Torsello, M.A., 2013. Shape Annotation by Incremental Semi-supervised Fuzzy Clustering. International Workshop on Fuzzy Logic and Applications, pp. 193-200. 17. Castellucci, G., Filice, S., Croce, D., and Basili, R., 2014. UNITOR: Aspect Based Sentiment Analysis with Structured Learning. Proc. 8th Int. Work. Semant. Eval. (SemEval 2014), no. SemEval, pp. 761-767. 18. Chapelle, O., Schölkopf, B., and Zien, A., 2006. Semi-Supervised Learning. Semi-Supervised Learning, pp. 377-393. 19. Chavan, M.R.S., and Sable, G.S., 2013. An Overview of Speech Recognition Using HMM. International Journal of Computer Science and Mobile Computing, 2(6), pp. 233-238. 20. Cho, T., 2016. Differences in the Romanized Spelling of Arabic Loanwords in Bahasa Melayu in Malaysia, and Bahasa Indonesia. MELAYU: JURNAL ANTARABANGSA DUNIA MELAYU, 9(2), pp. 262-278. 21. Chopra, D., and Morwal, S., 2013. Named entity recognition in English language using Hidden Markov Model. International Journal on Computational Sciences & Applications (IJCSA), 3(1), pp. 1-6. 22. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P., 2011. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12, pp. 2493-2537. 23. Conrad, K., 2010. Probability Distributions and Maximum Entropy. Published Manuscript, pp. 1-27. 24. Derczynski, L., 2016. Complementarity, F-score, and NLP Evaluation. In LREC. 25. Dhuria, S., and Taneja, H., 2014. Ontology Equipped Natural Language Processing for Real World Applications. International Journal of Advanced Research in Computer Science and Software Engineering, 4(4), pp. 1040-1043. 26. Diaz-Valenzuela, I., Vila, M.A., and Martin-Bautista, M.J., 2016. On the use of fuzzy constraints in semisupervised clustering. IEEE Transactions on Fuzzy Systems, 24(4), pp. 992-999. 27. Eftimov, T., Seljak, B.K., and Korošec, P., 2017. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations. PloS one, 12(6), pp.1-32. 28. Ekbal, A., and Bandyopadhyay, S., 2010. Named entity recognition using support vector machine: A language independent approach. International Journal of Electrical, Computer, and Systems Engineering, 4(2), pp. 155-170. 29. Ekbal, A., Haque, R., and Bandyopadhyay, S., 2008. Named entity recognition in Bengali: A conditional random field approach. Proceedings of IJCNLP, 2(1), pp. 589-594. 30. Figueroa, R.L., Zeng-Treitler, Q., Kandula, S., and Ngo, L.H., 2012. Predicting sample size required for classification performance. BMC medical informatics and decision making, 12(1), pp. 1-10. 31. Finkel, J.R., Grenager, T., and Manning, C., 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 363-370. 32. Finkel, J.R., and Manning, C.D., 2009. Nested named entity recognition. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pp. 141-150. 33. Goel, B., 2017. Developments in The Field of Natural Language Processing. International Journal of Advanced Research in Computer Science, 8(3), pp. 23-28. 34. Goutte, C., and Gaussier, E., 2005. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval, pp. 345-359. 35. Habib, M.B., and Keulen, M.V., 2012. Unsupervised improvement of named entity extraction in short informal context using disambiguation clues. CEUR Workshop Proc., 925, pp. 1-9. 36. Hosseinkhani, J., Koochakzaei, M., Keikhaee, S., and Naniz, J.H., 2014. Detecting suspicion information on the Web using crime data mining techniques. International Journal of Advanced Computer Science and Information Technology, 3(1), pp. 32-41. 37. Hassanat, A.B., Abbadi, M.A., and Alhasanat, A.A., 2014. Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach. International Journal of Computer Science and Information Security (IJCSIS) 12, pp. 33-39. 38. Iqbal, F., Fung, B., and Debbabi, M., 2012. Mining criminal networks from chat log. Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 01, pp. 332-337. 39. Iroju, O.G., and Olaleke, J.O., 2015. A systematic review of natural language processing in healthcare. International Journal of Information Technology and Computer Science, 8, pp. 44-50. 40. Jabbar, M.A., Deekshatulu, B.L., and Chandra, P., 2013. Classification of Heart Disease Using K- Nearest Neighbor and Genetic Algorithm. Procedia Technology, 10, pp. 85-94. 41. Jiang, R., Banchs, R.E., and Li, H., 2016. Evaluating and Combining Name Entity Recognition Systems. Proceedings of the Sixth Named Entity Workshop, pp. 21-27. 42. Jurafsky, D., and Martin, J.H., 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Speech and Language Processing An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition, 21, pp. 0-934. 43. Konkol, M., Brychcín, T., and Konopík, M., 2015. Latent semantics in named entity recognition. Expert Systems with Applications, 42(7), pp. 3470-3479. 44. Kotsiantis, S.B., Zaharakis, I., and Pintelas, P., 2007. Supervised machine learning: A review of classification techniques. Emerging Artificial Intelligence Applications in Computer Engineering, 160, pp. 3-24. 45. Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., and Fotiadis, D.I., 2015. Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, pp. 8-17. 46. Lam, T.Y., and Meyer, I.M., 2010. Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training. Algorithms for Molecular Biology, 5(1), pp.1-16. 47. Lau, R.Y., and Zhang, W., 2011. Semi-supervised statistical inference for business entities extraction and business relations discovery. Balog et al.[3], pp. 41-46. 48. Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., and Lee, B.S., 2012. Twiner: named entity recognition in targeted twitter stream. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’12, pp. 721-730. 49. Liu, J., 2014. Chinese named entity recognition algorithm based on the improved hidden Markov model. Journal of Chemical & Pharmaceutical Research, 6(7), pp. 1474-1478. 50. Ma, X., and Hovy, E., 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354. 51. Ménard, R., and Deshaies-Jacques, M., 2018. Evaluation of analysis by cross-validation. Part I: Using verification metrics. Atmosphere, 9(3), pp. 1-16. 52. Mitchell, T.M., 2006. The Discipline of Machine Learning. Machine Learning, 17, pp. 1-7. 53. Mohit, B., 2014. Named entity recognition. In Natural Language Processing of Semitic Languages, pp. 221-245. 54. Morsidi, F., Sarkawi, S., Sulaiman, S., Mohammad, S.A., and Wahid, R.A., 2015. Malay Named Entity Recognition: A Review. 2, pp. 1-14. 55. Morwal, S., Jahan, N., and Chopra, D., 2012. Named entity recognition using hidden Markov model (HMM). International Journal on Natural Language Computing (IJNLC), 1(4), pp. 15-23. 56. Nadeau, D., and Sekine, S., 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), pp. 3-26. 57. Nan, Y., Chai, K.M., Lee, W.S., and Chieu, H.L., 2012. Optimizing F-measure: A Tale of Two Approaches. Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 289-296. 58. Pathak, P., Goswami, R., Joshi, G., Patel, P., and Patel, A., 2013. CRF-based Clinical Named Entity Recognition using Clinical NLP Features. 59. Kouemou, G.L., 2011. History and theoretical basics of hidden Markov models. In Hidden Markov Models, Theory and Applications. InTech. 60. Powers, D.M., 2015. What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes. arXiv preprint arXiv:1503.06410. 61. Prokofyev, R., Demartini, G., and Cudré-Mauroux, P., 2014. Effective named entity recognition for idiosyncratic web collections. Proceedings of the 23rd International Conference on World Wide Web, pp. 397-408. 62. Rozan, M.Z.A., and Mikami, Y., 2007. Orthographic Reforms of Standard Malay Online: Towards Better Pronunciation and Construction of a Cross-language Environment. Journal of Universal Language, 8(1), pp.129-159. 63. Sarkar, P., and Purkayastha, B.S., 2016. A Study of the Natural Language Processing Tasks to Address Semantics Ambiguities. International Journal of Advanced Research in Computer Science and Software Engineering Research, 6(10), pp. 197-201. 64. Sathya, R., and Abraham, A., 2013. Comparison of supervised and unsupervised learning algorithms for pattern classification. International Journal of Advanced Research in Artificial Intelligence, 2(2), pp. 34-38. 65. Sayed-Mouchaweh, M., and Lughofer, E. eds., 2012. Learning in non-stationary environments: methods and applications. Springer Science & Business Media. 66. Shaalan, K., and Oudah, M., 2014. A hybrid approach to Arabic named entity recognition. Journal of Information Science, 40(1), pp. 67-87. 67. Shen, D., Zhang, J., Zhou, G., Su, J., and Tan, C.L., 2003. Effective adaptation of a hidden markov model-based named entity recognizer for biomedical domain. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, 13, pp. 49-56. 68. Suganya, R., and Shanthi, R., 2012. Fuzzy C-Means algorithm-a review. International Journal of Scientific and Research Publications, 2(11), pp. 1-3. 69. Sulaiman, S., Wahid, R.A., Sarkawi, S., and Omar, N., 2017. Using Stanford NER and Illinois NER to Detect Malay Named Entity Recognition. International Journal of Computer Theory and Engineering, 9(2), pp. 147-150. 70. Sutton, C., and McCallum, A., 2010. An Introduction to Conditional Random Fields ArXiv : 1011 . 4088v1 [ Stat . ML ] 17 Nov 2010. Arxiv Preprint ArXiv10114088, 50(7), pp. 1-90. 71. Sutton, C., and McCallum, A., 2011. An Introduction to Conditional Random Fields. Machine Learning, 4(4), pp. 267–373. 72. Syahputra, E.R., and Dalimunthe, Y.A., 2017. Application of Fuzzy C-Means Algorithm for Determining Field of Interest in Information System Study STTH Medan. In Journal of Physics: Conference Series, 930(1), pp. 1-6. 73. Tanwar, M., Duggal, R., and Khatri, S.K., 2015. Unravelling unstructured data: A wealth of information in big data. In Reliability, Infocom Technologies and Optimization (ICRITO)(Trends and Future Directions), 2015 4th International Conference, pp. 1-6. 74. Taylor, A., Marcus, M., and Santorini, B., 2003. The Penn Treebank: An overview. Treebanks, pp. 5-22. 75. Tirasaroj, N., and Aroonmanakun, W., 2011. The Effect of Answer Patterns for Supervised Named Entity Recognition in Thai. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation, pp. 392–399. 76. Tsuchiya, M., Endo, S., and Nakagawa, S., 2009. Analysis and robust extraction of changing named entities. Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pp. 161-167. 77. Wong, C., Guo, Z.X., and Leung, S.Y.S., 2013. Optimizing decision making in the apparel supply chain using artificial intelligence (AI): from production to retail. Elsevier, pp. 1–231. 78. Zhang, L., Pan, Y., and Zhang, T., 2004. Focused named entity recognition using machine learning. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 281-288. 79. Zhang, S., and Elhadad, N., 2013. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. Journal of Biomedical Informatics, 46(6), pp. 1088-1098. 80. Zu, Q., and Hu, B. eds., 2016. Human Centered Computing: Second International Conference, HCC 2016, Colombo, Sri Lanka, January 7-9, 2016 Revised Selected Papers (Vol. 9567). Springer. 81. Zuva, K., and Zuva, T., 2012. Evaluation of information retrieval systems. International Journal of Computer Science & Information Technology, 4(3), pp. 35-43.