Improved random forest for feature selection in writer identification

Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do no...

Full description

Saved in:
Bibliographic Details
Main Author: Sukor, Nooraziera Akmal
Format: Thesis
Language:English
English
Published: 2015
Subjects:
Online Access:http://eprints.utem.edu.my/id/eprint/16842/1/Improved%20Random%20Forest%20For%20Feature%20Selection%20In%20Writer%20Identification.pdf
http://eprints.utem.edu.my/id/eprint/16842/2/Improved%20random%20forest%20for%20feature%20selection%20in%20writer%20identification.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utem-ep.16842
record_format uketd_dc
institution Universiti Teknikal Malaysia Melaka
collection UTeM Repository
language English
English
advisor Draman @ Muda, Azah Kamilah
topic T Technology (General)
T Technology (General)
spellingShingle T Technology (General)
T Technology (General)
Sukor, Nooraziera Akmal
Improved random forest for feature selection in writer identification
description Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do not provide useful information and may cause to decrease the performance of a classifier. Thus, feature selection process is implemented in WI process. Feature selection is a process to identify and select the most significant features from presented features in handwriting documents and to eliminate the irrelevant features. Due to the WI framework, discretization process is applied before the feature selection process. Discretization process was proven to increase the classification performances and improved the identification performance in WI. An algorithm and framework of Improved Random Forest (IRF) tree was applied for feature selection process. RF tree is a collection of tree predictors used to ensemble decision tree models with a randomized selection of features at each split. It involved Classification and Regression Tree (CART) during the development of tree. Important features are measured by using Variable Importance (VI). While Mean Absolute Error (MAE) values use to identify the variance between writers, VI value was used for splitting process in tree and MAE value is to ensure the intra-class (same writer) invariance is lower than inter-class (different writer) invariance because lower intra-class invariance indicates accuracy to the real author. Number of selected features and the classification accuracy is used to indicate the performances of feature selection method. Experimental results have shown that the performances of IRF tree in discretized dataset produced third feature (f3) as the most important feature with average classification accuracy 99.19%. For un- discretized dataset, first feature (f1) and third feature (f3) are the most important features with average classification accuracy 40.79%.
format Thesis
qualification_name Master of Philosophy (M.Phil.)
qualification_level Master's degree
author Sukor, Nooraziera Akmal
author_facet Sukor, Nooraziera Akmal
author_sort Sukor, Nooraziera Akmal
title Improved random forest for feature selection in writer identification
title_short Improved random forest for feature selection in writer identification
title_full Improved random forest for feature selection in writer identification
title_fullStr Improved random forest for feature selection in writer identification
title_full_unstemmed Improved random forest for feature selection in writer identification
title_sort improved random forest for feature selection in writer identification
granting_institution Universiti Teknikal Malaysia Melaka
granting_department Faculty Of Information And Communication Technology
publishDate 2015
url http://eprints.utem.edu.my/id/eprint/16842/1/Improved%20Random%20Forest%20For%20Feature%20Selection%20In%20Writer%20Identification.pdf
http://eprints.utem.edu.my/id/eprint/16842/2/Improved%20random%20forest%20for%20feature%20selection%20in%20writer%20identification.pdf
_version_ 1747833900459098112
spelling my-utem-ep.168422022-06-07T13:30:20Z Improved random forest for feature selection in writer identification 2015 Sukor, Nooraziera Akmal T Technology (General) TA Engineering (General). Civil engineering (General) Writer Identification (WI) is a process to determine the writer of a given handwriting sample. A handwriting sample consists of various types of features. These features are unique due to the writer’s characteristics and individuality, which challenges the identification process. Some features do not provide useful information and may cause to decrease the performance of a classifier. Thus, feature selection process is implemented in WI process. Feature selection is a process to identify and select the most significant features from presented features in handwriting documents and to eliminate the irrelevant features. Due to the WI framework, discretization process is applied before the feature selection process. Discretization process was proven to increase the classification performances and improved the identification performance in WI. An algorithm and framework of Improved Random Forest (IRF) tree was applied for feature selection process. RF tree is a collection of tree predictors used to ensemble decision tree models with a randomized selection of features at each split. It involved Classification and Regression Tree (CART) during the development of tree. Important features are measured by using Variable Importance (VI). While Mean Absolute Error (MAE) values use to identify the variance between writers, VI value was used for splitting process in tree and MAE value is to ensure the intra-class (same writer) invariance is lower than inter-class (different writer) invariance because lower intra-class invariance indicates accuracy to the real author. Number of selected features and the classification accuracy is used to indicate the performances of feature selection method. Experimental results have shown that the performances of IRF tree in discretized dataset produced third feature (f3) as the most important feature with average classification accuracy 99.19%. For un- discretized dataset, first feature (f1) and third feature (f3) are the most important features with average classification accuracy 40.79%. 2015 Thesis http://eprints.utem.edu.my/id/eprint/16842/ http://eprints.utem.edu.my/id/eprint/16842/1/Improved%20Random%20Forest%20For%20Feature%20Selection%20In%20Writer%20Identification.pdf text en public http://eprints.utem.edu.my/id/eprint/16842/2/Improved%20random%20forest%20for%20feature%20selection%20in%20writer%20identification.pdf text en validuser https://plh.utem.edu.my/cgi-bin/koha/opac-detail.pl?biblionumber=96166 mphil masters Universiti Teknikal Malaysia Melaka Faculty Of Information And Communication Technology Draman @ Muda, Azah Kamilah 1. Agre G., P. S., 2002. On Supervised and Unsupervised Discretization. Cybernetics and Information Technologies. 2. Anastassopoulos, E. Z. a. V., 2000. Morphological waveform coding for writer identification. Pattern Recognition. 3. B. Zhang, S. S., and S. Lee, 2003.Individuality of handwritten characters.International Conference on Document Analysis and Recognition. Edinburgh, Scotland. 4. Bensefia, A., Nosary, A., Paquet, T., Heutte, L, 2002.Writer Identification by Writer’s Invariants.Eighth Intl. Workshop on Frontiers in Handwriting Recognition. Washington 5. Bensefia, A., Nosary, A., Paquet, T., Heutte, L, 2003.Information retrieval based writer identification,. Proceedings of the 7th International Conference on Document Analysis and Recognition, . 6. Bensefia, A., Nosary, A., Paquet, T., Heutte, L, 2005. Improving writer identification by means of feature selection and extraction,. Eight International Conference on Document Analysis and Recognition. 7. Bensefia, A., Nosary, A., Paquet, T., Heutte, L, 2005.A writer identification and verification system.Pattern Recognition Letters, In Press, Corrected Proof, Available online 8. Bin Zhang and Srihari, S. N., 2003. Analysis of Handwriting Individuality Using Word Features,. Document Analysis and Recognition. Proceedings.Seventh International Conference. 9. Cajote, R. D., Guevara, R.C.L., 2004. Global Word Shape Processing Using Polar-radii Graphs for Offline Handwriting Recognition. In: TENCON 2004 IEEE Region 10 Conference, IEEE Press. Washington 10. Chachra, G. L. a. S., 2003.Writer identification using innovative binarised features of handwritten numerals.International Conference on Document Analysis and Recognition. 11. Cheikh, F. A., 2004. A System for Content-Based Image Retrieval.PhD Thesis, Tampere University of Technology. 12. F.P. Satrya, M. A. K., Choo Y.H, Muda N. I, 2011 Computationally Inexpensive Sequential Forward Floating Selection for Acquiring Significant Features for Authorship Invarianceness in Writer Identification,. International Journal on New Computer Architectures and Their Applications (IJNCAA),The Society of Digital Information and Wireless Communications, 1(3), pp. 581-598. 13. F.P. Satrya, M. A. K., Choo Y.H, 2010. Feature Selection Methods for Writer Identification: A Comparative Study. International Conference on Computer and Computational Intelligence (ICCCI ). 14. Franke K., K. M., 2001. A Computer-based System to Support Forensic Studies on Handwritten Documents. International Journal on Document Analysis and Recognition, 15. Gadat, S., &Younes, L., 2007.A Stochastic Algorithm for Feature Selection in Pattern Recognition.Journal of Machine Learning Research, pp. 509-547. 16. He, Z. Y., Tang, Y.Y.: In: , Chinese Handwriting-based Writer Identification by Texture Analysis. Proceedings of 2004 Intl. Conference on Machine Learning and Cybernetic, IEEE Press, Washington (2004). 17. Kai-QuanShen, C.-J.O., Xiao-Ping Li*, ZhengHui, and Einar P. V. Wilder-Smith, 2007. A Feature Selection Method for Multilevel Mental Fatigue EEG Classification:.IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING. 18. Kim, G. K. a. S., 2000.Feature selection using genetic algorithms for handwritten character recognition,. Seventh International Workshop on Frontiers in Handwriting Recognition. 19. L. Yu, H. L., in: , 2003. Efficiently handling feature redundancy in high-dimensional data. Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03), Washington, DC. Washington, DC. 20. Lee, S. N. S. S.-H. C. a. S., 2001.Establishing handwriting individuality using pattern recognition techniques. Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference 21. Liu, L. Y. a. H., 2003.Feature selection for high-dimensional data: a fast correlation-based filter solution. Proceedings of the Twentieth International Conference on Machine Learning,. 22. Liu.H., D. M., 1997. Feature Selection for Classification. Journal of Intelligent Data Analysis, pp. 131-156. 23. M. Hall, E. F., G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten 2009. The WEKA data mining software: an update.SIGKDD Explorations. 24. M.A. Hall, H., 1999. Correlation-based feature subset selection for machine learning. University of Waikato. 25. Marti, U., &Bunke, H. (). ., . 2002. The IAM-database: an English Sentence Database for Off-line Handwriting Recognition. International Journal on Document Analysis and Recognition,, Volume 5, pp. 39-46. 26. Muda A.K, P. S. F., Choo Y.H, Muda N.A, 2011. Selecting Significant Features for Authorship Invarianceness in Writer Identification.ICSECS 2011, Part I, CCIS 179. © Springer-Verlag Berlin Heidelberg 2011 27. Muda, A. K., 2009. . Authorship invarianceness for writer identification using invariant discretization and modified immune classifier. Johor: UniversitiTeknologi Malaysia. 28. Muda AK, SitiMariyamHj.Shamsuddin and MaslinaDaru, 2007:.Embedded Scale United Moment Invariant for Identification of Handwriting Individuality,. ICCSA (1) 29. Muda, S. M. S., and M.Darus, 2008. Invariants discretization for individuality representation in handwritten authorship. 2nd International Workshop on Computational Forensics. 30. P.Refaeilzadeh, L. T., and H. Liu, 2007.On comparison of feature selection algorithms. Proceedings of AAAI Workshop on Evaluation Methods for Machine Learning II,. 31. Palhang, M., Arcot S and . Feature Extraction: Issues, New Features, and Symbolic Representation. Third Intl. Conference on Visual Information and Information Systems Springer-Verlag, London (1999). 32. Parisse, 1996.Global Word Shape Processing in Off-line Recognition of Handwriting. IEEE Trans. on Pattern Analysis and Machine Intelligence 18. 33. Portinale, L., &Saitta, L. (2002). (). , Feature Selection: State of the Art. In L. Portinale, & L. Saitta, Feature Selection Alessandria: UniversitadelPiemonte Orientale. 34. S, N. H., 1998. Discretization Problems for Rough Set Methods, Rough Sets & Current Trend in Computing. First International Conference of RSCTC'98, LNAI 1424.Warsaw, Poland,. 35. S. N. Srihari, M. J. B., K. Bandi, V. Shah, P. Krishnamurthy, 2005. A statistical model for writer verification. Proceedings of the 8th International Conference on Document Analysis and Recognition. 36. S. N. Srihari; Cha, S.-H. A., H.; and Lee, 2002. Individuality of Handwriting,. Journal of Forensic Sciences, 47(4), pp. 1-17. 37. S. Srihari, S. C., H. Arora, and S. Lee, 2001. Individuality of handwriting: a validation study.International Conference on Document Analysis and Recognition. 38. S.N. Srihari, S.-H.C., H. Arora, and S. Lee, 2000.Individuality of handwriting.Forensic Science. 39. Saeys, Y., Inza, I., &Larranaga, P. , 2007. A Review of Feature Selection Techniques in Bioinformatics. . Journal of Bioinformatics, pp. 2507-2517. 40. Said H. E. S., T. T. N., Baker K. D., 2000. Writer Identfication Based on Handwriting, Pattern Recognition. 41. Schlapbach, A., Bunke, H., 2004. Off-line Handwriting Identification Using HMM Based Recognizers. Proc. 17th Int. Conf. on Pattern Recognition, pp. 654-658. IEEE Press. Washington 42. Setiono, H. L. a. R., 1996. A probabilistic approach to feature selection - a filter solution.International Conference of Machine Learning,. 43. Shamsuddin, A. K. M. S. M., 2005. A Framework of Artificial Immune System in Writer Identification. Proceeding of International Symposium on Bio-Inspired Computing. Johor Bahru. 44. Shen, C., Ruan, X.-G., Mao, 2002).Writer Identification Using Gabor Wavelet.Proceedings of the 4th World Congress on Intelligent Control and Automation. Washington 45. Somaya M., E. M., Dori K., FatmaM 2008. Writer Identification Using Edge-based Direc- 46. tional Probability Distribution Features for Arabic Words. IEEE/ACS International Conference on Computer Systems and Applications, AICCSA. 47. Vinciarelli, A., 2002. A Survey on Off-line Cursive Word Recognition. Pattern Recognition 48. X. Wang, D., H. Liu,, 2003. Writer identification using directional element features and linear transform. Proceedings of the 7th International Conference on Document Analysis and Recognition, . 49. Y. Saeys, I. I., and P. Larranaga and 2007.A review of feature selection techniques in bioinformatics. Bioinformatics. 50. Y. Kun, W. Y., T. Tieniu, 2004. Writer identification using dynamic features, Biometric Authentication.First International Conference, ICBA 2004,. Hong Kong, China. 51. Yinan, S., Weijun, L., Yuechao, W, 2003.United Moment Invariant for Shape Discrimination.IEEE Intl.Conference on Robotics, Intelligent Systems and Signal Processing. Washington 52. Younes, S. G. a. L., 2007. A stochastic algorithm for feature selection in pattern recognition. Machine Learning Research. 53. Zexuan Zhu, Y.-S. O., Manoranjan Dash., 2007. Wrapper-filter feature selection algorithm using a memetic framework. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics: a publication of the IEEE Systems, Man, and Cybernetics Society