Font Size: a A A

Research In Chinese Text Proofreading Based On OCR

Posted on:2012-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HuanFull Text:PDF
GTID:2178330332992385Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The optical character recognition technology is a procedure that the electronic equipments such as Scanner or digital camera and so on make sure the shape of the paper document characters by detecting the mode of brightness and darkness, and then translate the shape into computer words through character recognition. At present, the OCR technology has become one of the most important methods of translating paper document to digital document. In the procedure of translating, the current OCR technology could not ensure that the content of digital document is completely right, therefore, the recognition document need to be checked and proofread.The subject comes from the "Eleventh Five-Year" National Key Project Support Platform - "The development of Reading Aids for the Visually Impaired". The Reading aids convert printed text to speech output through the OCR technology and speech synthesis technology. Therefore, the proofreading object of the research topic is OCR recognition errors.In the research of the topic, we count and analyze the characteristics of the OCR recognition errors, and re-classify the errors. Then we study and research the current Chinese text proofreading algorithms, and put forward an algorithm named "An Improved Automatic Chinese Text Error Correction Approach Based on Window Technology". The improved algorithm takes full account of the characteristics of the OCR recognition errors and the characteristics of the topic application platform. Towards to the basic algorithm, the new algorithm makes some improvements in the following aspects, such as in the pre-processor stage of proofreading, we choose a better and more mature Chinese word segmentation system which named Institute of Computing Technology Chinese Lexical Analysis System; in the stage of automatic detecting errors of Chinese text, we make full use of "san string" in order to improve the efficiency of detecting errors; in the stage of automatic correcting errors of Chinese text, we abandon the correcting method in the basic algorithm, because it construct the confusion sets by characteristics of homophone which is not applicable to the characteristics of the OCR recognition errors, and the improved algorithm provides correction advices through combining the original text for proofreading with the character-driven two-way dictionary. At last, we used the language of C++ in the environment of Visual C++ 6.0 to realize a Chinese text automatic proofreading system under the platform of Windows. And then go through the system test, the results of the test show that the system with the improved algorithm has a better recall rate and a better precision rate, but its performance in the aspect of error-correction rate is bad, it need frequently user interaction to get correcting candidate suggestion through the system user.After analysis of test results, in the final of the paper we summarized the difficulty and problems in the procedure of topic research, experimental system design and paper writing. At last, we outlook the development prospects of the Chinese text proofreading in the future!...
Keywords/Search Tags:OCR, Chinese Text Automatic Proofreading, Recall Rate, Precision Rate, Error-correction Rate
PDF Full Text Request
Related items