Font Size: a A A

Deep Learning Based Urdu Optical Character Recognition

Posted on:2018-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:IbrarFull Text:PDF
GTID:1318330518995981Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Due to unprecedented developments in machine learning coupled with pattern recognition and computer vision algorithms, very successful Optical Character Recognition (OCR) system can be seen in every part of daily life.Optical character recognition is a field of research in pattern recognition, artificial intelligence and computer vision. OCR used for the recognition of text documents is widely applicable in both research and industry. OCR is a way to convert typewritten, handwritten or printed text into machine-encoded text. OCR for many languages like Mandarin (Chinese),Spanish, English, Arabic, Japanese, Russian etc. are much more accurate and have numerous applications in daily life. However, there are some Arabic script languages like Urdu, Persian, Pashto, Balochi and Sindhi etc. that still need much more advancement and improvement in the OCR field. All these languages pose difficulties for researchers and developers in dealing with the wide variability of characters' shapes and cursiveness. Thus, it is an uphill task to devise and develop OCRs for such languages. Although, the research in this field has got some momentum from last decade but still the dilemma is the scarcity of resources and researchers.Deep Neural Networks (DNN) is outperforming for classification and recognition tasks. This better performance is primarily due to automatic feature learning. These features often lead to better performance than human engineered feature. Also, there is no need expertise of features extraction, nor requires domain knowledge. Furthermore, features can be extracted from different domains with the help of same algorithm Autoencoders, stacked autoencoders and Long Short Term Memory (LSTM), Bidirectional LSTM(BLSTM) are the forms of DNN incorporating multi-layered feature processing and learning.The objective of this thesis is to improve Urdu OCRs by use state of the art machine learning techniques. Firstly, segmentation is improved by putting forward line and ligature segmentation algorithms. These algorithms performed with better accuracy by using thresholding method with a curved line split algorithm and better allocation of dots/diacritics. Secondly,recognition of Urdu text is performed on ligature and line levels by using deep learning methods. Autoencoders are employed for ligatures' feature extraction instead of human crafted features. For classification of ligature,softmax and SVM are employed at the output layers, achieved accuracy of 98%. LSTM networks are successfully employed for context based Urdu sentence recognition in existing contributions. Gated BLSTM (GBLSTM)networks is introduced and evaluated on UPTI datasets that yielded better results than other prevalent OCR techniques. Gated BLSTM takes sentences as input labelled with ligature and softmax at output layer recognized sentences with 96% accuracy. All prevalent context based sentence recognition contributions rely on character based LSTM.
Keywords/Search Tags:Urdu text segmentation, Nastaleeq script segmentation, line and ligature segmentation, Urdu Nastaleeq ligature recognition, offline printed ligature recognition, Arabic script, Denoising Autoencoder, Deep learning Network, Classification
PDF Full Text Request
Related items