Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system

Automatic word recognition has proved an intensive research subject for many languages in the last decades, but it is still far from the final frontier for some languages. The word recognition is divided into two types: online and offline. The current research is focused on the offline handwritten w...

Full description

Saved in:
Bibliographic Details
Main Author: Akbarpour, Shahin
Format: Thesis
Language:English
Published: 2011
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/26987/1/FSKTM%202011%2021R.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.26987
record_format uketd_dc
spelling my-upm-ir.269872015-05-14T07:24:38Z Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system 2011-08 Akbarpour, Shahin Automatic word recognition has proved an intensive research subject for many languages in the last decades, but it is still far from the final frontier for some languages. The word recognition is divided into two types: online and offline. The current research is focused on the offline handwritten word recognition (FHWR). An offline handwritten word recognition system includes many stages. All stages should be improved in order to enhance accuracy of the system. In addition, one of the most significant current discussions in enhancement of the accuracy of handwritten word recognition is reducing the lexicon size. Many studies have been carried out so far, but FHWR has not been researched as thoroughly as Latin or Chinese handwritten systems. Several attempts have been made to address FHWR, most of which focusing on the image preprocessing and segmentation. It is also worth mentioning that some studies have already been done on the feature extraction, classification and lexicon reduction methods. In the latest and the most successful prior studies, a feature extraction method, a lexicon reduction, and hidden Markov model (HMM) have been used. However, the recognition rate is not superior owing to the fact that the feature extraction method could not truly describe the Farsi word. Moreover, there exist some limitations in HMM, and several segmentation errors occurred in their lexicon reduction. The current research is focused on solving the mentioned problems through improving the accuracy of recognition rate of FHWR by proposing a new feature extraction and lexicon reduction methods, and finding a suitable classification. In this regard, some special attributes of Farsi manuscripts such as the stroke directions, non-unique black pixels distribution on binary image of the word, the number of the sub-word(s) and dot(s) of the word will be considered. In addition, several classification methods will be tested in order to determine which one is the best for better accuracy of recognition rate other than HMM. We developed two word recognizer systems to cater for different applications based on different lexicon size. For small lexicons, the word recognizer system consists of a new feature extraction and a classifier, and for medium and large lexicons, the system includes a new feature extraction and lexicon reduction methods and a classifier. For the performance evaluation of the proposed methods, we use four different Farsi handwritten datasets such as Farshids‟ Legal amount, 198-Cities, Iranshahr, and IFN-AUT, which contained 45, 198, 503, and 1080 class-words, respectively. In addition, for comparison of the obtained results with the previous works, we need proper datasets used by prior researchers. AUT and IFN-AUT were applied previously. The AUT, which included 198 class-words, was not available, but a similar dataset, 198-Cities, was created by random selection of 198 class-words from Iranshahr dataset. In order to conduct more experiments based on different lexicon size, the proposed methods were run on Farshids‟ Legal amount and Iranshahr datasets as well. Moreover, we re-implemented the existing word recognizer and lexicon reduction method so that we could test for comparison using the same dataset such as 198-Cities and IFN-AUT. It might be concluded that our methods, which consist of a new feature extraction and lexicon reduction methods and the classifier, perform better than the latest works. Support vector machines Persian language - Written Persian APT (Computer program language) 2011-08 Thesis http://psasir.upm.edu.my/id/eprint/26987/ http://psasir.upm.edu.my/id/eprint/26987/1/FSKTM%202011%2021R.pdf application/pdf en public phd doctoral Universiti Putra Malaysia Support vector machines Persian language - Written Persian APT (Computer program language) Faculty of Computer Science and Information Technology
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
topic Support vector machines
Persian language - Written Persian
APT (Computer program language)
spellingShingle Support vector machines
Persian language - Written Persian
APT (Computer program language)
Akbarpour, Shahin
Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system
description Automatic word recognition has proved an intensive research subject for many languages in the last decades, but it is still far from the final frontier for some languages. The word recognition is divided into two types: online and offline. The current research is focused on the offline handwritten word recognition (FHWR). An offline handwritten word recognition system includes many stages. All stages should be improved in order to enhance accuracy of the system. In addition, one of the most significant current discussions in enhancement of the accuracy of handwritten word recognition is reducing the lexicon size. Many studies have been carried out so far, but FHWR has not been researched as thoroughly as Latin or Chinese handwritten systems. Several attempts have been made to address FHWR, most of which focusing on the image preprocessing and segmentation. It is also worth mentioning that some studies have already been done on the feature extraction, classification and lexicon reduction methods. In the latest and the most successful prior studies, a feature extraction method, a lexicon reduction, and hidden Markov model (HMM) have been used. However, the recognition rate is not superior owing to the fact that the feature extraction method could not truly describe the Farsi word. Moreover, there exist some limitations in HMM, and several segmentation errors occurred in their lexicon reduction. The current research is focused on solving the mentioned problems through improving the accuracy of recognition rate of FHWR by proposing a new feature extraction and lexicon reduction methods, and finding a suitable classification. In this regard, some special attributes of Farsi manuscripts such as the stroke directions, non-unique black pixels distribution on binary image of the word, the number of the sub-word(s) and dot(s) of the word will be considered. In addition, several classification methods will be tested in order to determine which one is the best for better accuracy of recognition rate other than HMM. We developed two word recognizer systems to cater for different applications based on different lexicon size. For small lexicons, the word recognizer system consists of a new feature extraction and a classifier, and for medium and large lexicons, the system includes a new feature extraction and lexicon reduction methods and a classifier. For the performance evaluation of the proposed methods, we use four different Farsi handwritten datasets such as Farshids‟ Legal amount, 198-Cities, Iranshahr, and IFN-AUT, which contained 45, 198, 503, and 1080 class-words, respectively. In addition, for comparison of the obtained results with the previous works, we need proper datasets used by prior researchers. AUT and IFN-AUT were applied previously. The AUT, which included 198 class-words, was not available, but a similar dataset, 198-Cities, was created by random selection of 198 class-words from Iranshahr dataset. In order to conduct more experiments based on different lexicon size, the proposed methods were run on Farshids‟ Legal amount and Iranshahr datasets as well. Moreover, we re-implemented the existing word recognizer and lexicon reduction method so that we could test for comparison using the same dataset such as 198-Cities and IFN-AUT. It might be concluded that our methods, which consist of a new feature extraction and lexicon reduction methods and the classifier, perform better than the latest works.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Akbarpour, Shahin
author_facet Akbarpour, Shahin
author_sort Akbarpour, Shahin
title Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system
title_short Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system
title_full Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system
title_fullStr Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system
title_full_unstemmed Improved feature extraction and lexicon reduction methods classified by support vector machine for Farsi handwritten word recognition system
title_sort improved feature extraction and lexicon reduction methods classified by support vector machine for farsi handwritten word recognition system
granting_institution Universiti Putra Malaysia
granting_department Faculty of Computer Science and Information Technology
publishDate 2011
url http://psasir.upm.edu.my/id/eprint/26987/1/FSKTM%202011%2021R.pdf
_version_ 1747811566371209216