Font Size: a A A

OCR Error Post-correction Based On Chinese Character-level Features And Language Model

Posted on:2022-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:S L LiuFull Text:PDF
GTID:2518306551953889Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
At present,OCR technology based on Deep Neural Network(DNN)has been able to achieve a high accuracy rate on standardized data sets.Therefore,when applied to real-world scenarios,the image distortion,picture rotation,and poor picture quality will cause the DNN-based OCR system to fail to work correctly.Therefore,in order to cope with those problems,many OCR recognition error post-correction technologies have also emerged.However,most of OCR error post-correction works deal with English or similar languages with a small number of basic characters.Because the size of English character set is small and those characters are usually dissimilar to each other,performing error correction is relatively easy.However,it is quite difficult to post-correct OCR errors of Asian languages with a big character set,such as Chinese and Japanese.Especially for Chinese,the 21003 basic characters in GBK lead to a big candidate set of similar characters when correcting an error.Even the 3755 level-1commonly used characters in GB2312,is much larger than the 52 basic characters in English.Furthermore,most of the current OCR error post-correction works consider characters as the basic units.Language models only consider the associated information between characters,but do not use the information inside the characters.For Chinese OCR error post-correction,there is still room to improve works that only use language models for error correction,by utilizing complex stroke,layout information in Chinese characters.To deal with the above problems,this thesis models the characteristics of the Chinese character stroke structure information,and proposes a JSWE method for generating an error correction candidate set based on the Chinese character stroke structure(Joint-Structure-Word-Embedding),and empirically proves that the Chinese character stroke structure is useful for error correction.We also propose an error correction candidate generation method based on the Ham-Ming-Distance between Chinese character bitmaps to solve the problem of poor quality of error correction candidates generated by language model when the context information is unavailable.The main contributions are as follows:1)We train an OCR model based on CTPN+CTC,and perform data augmentation based on this model,and generate an augmented data set with various OCR error styles for subsequent experiments.2)We propose an error correction candidate generation method based on the Chinese character stroke structure(JSWE).We first introduce a word vector training method based on the Chinese character stroke structure,and then generate the Chinese character stroke structure-aware word vectors.Afterward,we calculate the similarity between the embedding of the error character and that of the candidates to generate the error correction candidate set.3)We propose a method based on the Hamming distance between Chinese character bitmaps,and use the Hamming distance between characters as a character similarity measure to solve the problem of poor performance of the BERT language model when context information is not available.In order to validate the effectiveness of the proposed methods,extensive experiments and analyses are carried out on the public Chinese data sets and a batch of augmented data with OCR errors.The experimental results show that our JSWE method can effectively improve the generation quality of the error correction candidate set,and the Hamming distance-based error correction candidate generation method can also be very effective when the context information is not available.At last,experiments on similarity and semantic tasks are carried out,validating the generalizability of the word vectors generated by JSWE in general tasks,...
Keywords/Search Tags:Chinese Text, Chinese Character Error Correction, Word Vector, Chinese Character Similarity, Pre-trained Language Model
PDF Full Text Request
Related items