Research In Chinese Text Proofreading Based On OCR

Posted on:2012-02-14

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Huan

Full Text:PDF

GTID:2178330332992385

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The optical character recognition technology is a procedure that the electronic equipments such as Scanner or digital camera and so on make sure the shape of the paper document characters by detecting the mode of brightness and darkness, and then translate the shape into computer words through character recognition. At present, the OCR technology has become one of the most important methods of translating paper document to digital document. In the procedure of translating, the current OCR technology could not ensure that the content of digital document is completely right, therefore, the recognition document need to be checked and proofread.The subject comes from the "Eleventh Five-Year" National Key Project Support Platform - "The development of Reading Aids for the Visually Impaired". The Reading aids convert printed text to speech output through the OCR technology and speech synthesis technology. Therefore, the proofreading object of the research topic is OCR recognition errors.In the research of the topic, we count and analyze the characteristics of the OCR recognition errors, and re-classify the errors. Then we study and research the current Chinese text proofreading algorithms, and put forward an algorithm named "An Improved Automatic Chinese Text Error Correction Approach Based on Window Technology". The improved algorithm takes full account of the characteristics of the OCR recognition errors and the characteristics of the topic application platform. Towards to the basic algorithm, the new algorithm makes some improvements in the following aspects, such as in the pre-processor stage of proofreading, we choose a better and more mature Chinese word segmentation system which named Institute of Computing Technology Chinese Lexical Analysis System; in the stage of automatic detecting errors of Chinese text, we make full use of "san string" in order to improve the efficiency of detecting errors; in the stage of automatic correcting errors of Chinese text, we abandon the correcting method in the basic algorithm, because it construct the confusion sets by characteristics of homophone which is not applicable to the characteristics of the OCR recognition errors, and the improved algorithm provides correction advices through combining the original text for proofreading with the character-driven two-way dictionary. At last, we used the language of C++ in the environment of Visual C++ 6.0 to realize a Chinese text automatic proofreading system under the platform of Windows. And then go through the system test, the results of the test show that the system with the improved algorithm has a better recall rate and a better precision rate, but its performance in the aspect of error-correction rate is bad, it need frequently user interaction to get correcting candidate suggestion through the system user.After analysis of test results, in the final of the paper we summarized the difficulty and problems in the procedure of topic research, experimental system design and paper writing. At last, we outlook the development prospects of the Chinese text proofreading in the future!...

Keywords/Search Tags:

OCR, Chinese Text Automatic Proofreading, Recall Rate, Precision Rate, Error-correction Rate

PDF Full Text Request

Related items

1	The Realization Of Statistic Software Of Typed Error Rate In Chinese Text
2	Research On Feature Selection Algorithm And Classification Algorithm In Chinese Text Categoriztion
3	Natural Language Processing Of Chinese Text Automatic Proofreading
4	Bit error rate locked loops using log-likelihood error correction decoders
5	Research And Implementation Of Error Detection And Error Correction Efficiency Optimization Of Chinese Text
6	Research And Realization Of Non-word Error Automatic Proofreading System In Chinese Text
7	Research On Automatic Generation Technology Of Chinese Text Proofreading Corpora
8	Rate-Distortion Model And Optimization For Wireless Low-Bit-Rate Video Applications
9	The Design And Realization Of Low-rate Bit Error Tester
10	Fault-tolerant Error Correction For Video Encoding Transmission Optimization Method