Font Size: a A A

Probabilistic methods for searching OCR-degraded Arabic text

Posted on:2004-11-27Degree:Ph.DType:Dissertation
University:University of Maryland College ParkCandidate:Darwish, Kareem MFull Text:PDF
GTID:1468390011470855Subject:Computer Science
Abstract/Summary:
Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an important problem. This dissertation addresses retrieval of Arabic document images based on OCR, with emphasis on probabilistic methods to improve retrieval effectiveness. Arabic's rich morphology (word construction) and complex orthography (writing system) present unique challenges for OCR and Information Retrieval (IR) systems.; New probabilistic structured query methods that leverage replacement probabilities were developed in this research to improve retrieval effectiveness in OCR degraded text retrieval and their generality has been shown in cross-language information retrieval. For the OCR-degraded text retrieval, the probabilistic structured query methods were applied using the most effective index terms for OCR-degraded text, with replacement probabilities estimated using an OCR degradation model. Overlapping character n-grams and combinations of character n-grams with terms obtained through morphological analysis were found to be the most effective indexing terms for Arabic collections of varying sizes, genres, and degradation levels. For index terms requiring morphological analysis, existing automated Arabic morphological analysis techniques were adapted to make them more suitable for IR applications. Different OCR models were developed to account for the complex Arabic orthography and a group of tests were crafted to verify that modeled and real OCR degradation have similar effects on IR. One of the models was also used to synthetically degrade a large document collection.; The techniques presented in this dissertation offer the potential to unlock access to printed documents of Arabic and other languages with similar orthographic characteristics.
Keywords/Search Tags:Arabic, OCR, Retrieval, Probabilistic, Methods, Document, Ocr-degraded, Text
Related items