Multi-font Arabic Optical Character Recognition Using HMM And Decision Tree

Posted on:2016-07-24

Degree:Doctor

Type:Dissertation

Country:China

Candidate:T F M o h a m m e d L u t f

Full Text:PDF

GTID:1108330467498404

Subject:Communication and Information System

Abstract/Summary:

Optical character recognition (OCR) is a well-established subject for many languages especially English and Chinese, but for Arabic it is still in its early stages. Recently, much attention has been paid to Arabic for both handwritten and machine-printed text recogni-tion. The majority of the publications on the subject have agreed on one point:processing Arabic text images is a very challenging task compared to other languages. This is true due to the characteristics of the Arabic writing system which make the recognition task more complex. Such characteristics are:the text direction is from right to left, the cursive writ-ing in both handwritten and machine-printed text, each character has different shapes for different positions in a word, the dots and diacritical signs above and below the character-s, the variable length of the elongation of the connecting lines between the characters, the vertical or horizontal ligatures and the different sizes for each character (height and width). All of these characteristics influence the processing and recognition of Arabic characters in different ways and make a simple adaptation of the Latin character-based processing not possible.The main issue with all currently proposed methods is that none of them has addressed the problem of Arabic OCR considering the characteristics of Arabic text as advantages which may simplify the problem. They describe these characteristics only as the source of the complexity. But, in this thesis, we describe how we can use the characteristics of Ara-bic writing to make the recognition task simpler and build a very robust multi-font Arabic machine-printed OCR system. These characteristics are:cursive writing, position depen-dent character shape, and diacritics.In addition to character recognition, font recognition is a fundamental issue in OCR system where the possibility of increasing the efficiency of the OCR system can be guaran-teed by taking font type into account. Automatic document processing (ADP) techniques tackle font recognition on the basis of two main aspects. First, it generalizes the font type for all characters in the document. The use of this approach only enables us to reduce the num-ber of alternative forms of each class of a font family. This clearly leads to the recognition of only one kind of font. The second aspect that should be considered in ADP techniques is the identification of the font types used within the document, which is usually neglected in spite of its importance.Diacritics are a unique characteristic of Arabic writing system. It have been introduced after the adoption of Arabic writing system into languages other than Arabic, such as:Per-sian, Urdu, and Pashto languages. In this thesis we show how important are the diacritics and how we can use it to increase the accuracy and reliability of Arabic OCR system. First, we use diacritics for font recognition, then we build an OCR system which refine the recog-nition result also using diacritics.In this thesis, we have implemented a multi-font Arabic OCR system. It includes the document preprocessing, features extraction and classification. The system has been tested using two different databases, one for font recognition and the other for character recogni-tion. The main research work is presented as follow:1. Diacritic segmentation:Three different algorithms have been developed for dia-critic segmentation. Depending on the amount and the complexity of the document image, we can segment all diacritics to use it for font recognition. After segmenting all diacritics, the remaining text body will be used for character recognition.2. Features extraction:depending on the task, two different types of features extrac-tion procedures are introduced, composite central and ring projection features for font recognition, and multi-layer separation features for character recognition.3. Classification:we used the normalized cross correlation for font classification, and hidden markov models for characters classification. The output of hidden markov models are then feed to a decision tree which combine the original text image with the HMMs outputs to assign the most appropriate diacritic class to each character. The experiments show that our approach is valid for Arabic font and character recognition. Comparing to other methods, the most obvious advantage of our approach is the suppression of diacritics ambiguity which is the main error source for any Arabic text processing technique. Other main advantage of our approach is the combination of the font and character recognition where many processing blocks could be shared between the two tasks which in turn result in speeding up the system processing time.

Keywords/Search Tags:

Arabic, diacritics, composite centeral and ring projection, hidden markov mod-els, normalized cross correlation, decision tree

Related items

1	Offline Arabic Handwriting Identification Using Language Diacritics
2	Research On Color-to-gray Conversion Based On Normalized Cross Correlation
3	Studies On Off-line Handwritting Arabic Characters Recognition Key Technology
4	Vision Sensor And Its Image Registration Algorithm
5	Research On Cross-modia Retrieval Based On Hidden Layer Semantic Correlation
6	Study On The Normalized Cross-correlation Wavefront Gradient Processor Based On CPU
7	Research On Text And Image Cross-media Retrieval Based On Decision Tree Hash
8	Research On Image Registration Based On Normalized Cross-correlation Matching
9	Research On Multiresolution Hidden Markov Model For Image Denoising
10	The Contourlet-based Statistical Models For SAR Images Denoising