Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition

Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing,...

Full description

Saved in:
Bibliographic Details
Main Author: Arwa Mahmoud Yousef Al-Khatatneh
Format: Thesis
Language:en_US
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-usim-ddms-13152
record_format uketd_dc
spelling my-usim-ddms-131522024-05-29T05:43:31Z Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition Arwa Mahmoud Yousef Al-Khatatneh Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing, segmentation, feature extraction and classification. Binarization is the main pre-processing process that consists of the existing local and global thresholding methods. However, those methods are not applicable in many binarization problems especially for degraded document images. Baseline estimation is another pre-processing method aims to extract the virtual horizontal line where all characters lay and join in a specific part of each character. This existing method is inaccurate due to irregularity in sub-words alignment and a wide variety of free writing styles. The third component in OCR system is segmentation of text into characters. Nevertheless, the cursive, ligatures and overlapping characters differentiate the Arabic script from other languages. Therefore, Arabic OCR system requires a highly sophisticated segmentation method. This work proposes three methods and framework for OCR. First, the proposed compound binarization method that combines th e advantages of local and global thresholding method, tested on DIBCO 2009, 20 1 1 and 20 13 benchmark. Based on experimental results, the F-measure of proposed binarization method for printed document image is 88% and for handwritten is 78%, while the PSNR measurement for printed document image is 15.99 and for handwritten is 16.34. Secondly the proposed baseline estimation method for binary image based on feature points detection which tested on IFNIENIT dataset, when the estimated pixel error is less than 15 pixels the accuracy of the proposed baseline estimation method is 87.3%. And finally the proposed segmentation method based on baseline estimation and structural rules which tested using IFNIENIT dataset, the accuracy of the proposed method is 87.09%. The developed methods gained better accuracy rate when compared with the state of the art methods using quantities measurements.I t is able to recover document image of Arabic texts. 2016-11 Thesis en_US https://oarep.usim.edu.my/handle/123456789/13152 https://oarep.usim.edu.my/bitstreams/de079f30-e2e6-400b-ac96-42fe29b8cd35/download 8a4605be74aa9ea9d79846c1fba20a33 Optical character recognition (OCR) Arabic character sets (Data processing) Arabic scripts.
institution Universiti Sains Islam Malaysia
collection USIM Institutional Repository
language en_US
topic Optical character recognition (OCR)
Arabic character sets (Data processing)
Arabic scripts.
spellingShingle Optical character recognition (OCR)
Arabic character sets (Data processing)
Arabic scripts.
Arwa Mahmoud Yousef Al-Khatatneh
Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
description Optical character recognition (OCR) is a system aims to improve human machine interaction and widely used in many areas. Recognition of Arabic characters is difficult due to the cursive nature of Arabic scripts. The Arabic OCR system consists of five components: image acquisition, pre-processing, segmentation, feature extraction and classification. Binarization is the main pre-processing process that consists of the existing local and global thresholding methods. However, those methods are not applicable in many binarization problems especially for degraded document images. Baseline estimation is another pre-processing method aims to extract the virtual horizontal line where all characters lay and join in a specific part of each character. This existing method is inaccurate due to irregularity in sub-words alignment and a wide variety of free writing styles. The third component in OCR system is segmentation of text into characters. Nevertheless, the cursive, ligatures and overlapping characters differentiate the Arabic script from other languages. Therefore, Arabic OCR system requires a highly sophisticated segmentation method. This work proposes three methods and framework for OCR. First, the proposed compound binarization method that combines th e advantages of local and global thresholding method, tested on DIBCO 2009, 20 1 1 and 20 13 benchmark. Based on experimental results, the F-measure of proposed binarization method for printed document image is 88% and for handwritten is 78%, while the PSNR measurement for printed document image is 15.99 and for handwritten is 16.34. Secondly the proposed baseline estimation method for binary image based on feature points detection which tested on IFNIENIT dataset, when the estimated pixel error is less than 15 pixels the accuracy of the proposed baseline estimation method is 87.3%. And finally the proposed segmentation method based on baseline estimation and structural rules which tested using IFNIENIT dataset, the accuracy of the proposed method is 87.09%. The developed methods gained better accuracy rate when compared with the state of the art methods using quantities measurements.I t is able to recover document image of Arabic texts.
format Thesis
author Arwa Mahmoud Yousef Al-Khatatneh
author_facet Arwa Mahmoud Yousef Al-Khatatneh
author_sort Arwa Mahmoud Yousef Al-Khatatneh
title Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_short Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_full Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_fullStr Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_full_unstemmed Compound Binarization for Degraded Document Image And Feature Point Extraction For Handwritten Arabic Optical Character Recognition
title_sort compound binarization for degraded document image and feature point extraction for handwritten arabic optical character recognition
_version_ 1812444671521062912