Font Size: a A A

Utilizing big data in identification and correction of OCR errors

Posted on:2014-02-13Degree:M.S.C.SType:Thesis
University:University of Nevada, Las VegasCandidate:Agarwal, ShivamFull Text:PDF
GTID:2450390008461015Subject:Computer Science
Abstract/Summary:
In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate.;Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.
Keywords/Search Tags:Errors, Data, Correction
Related items