The mass digitization of textual resources,such as books,newspaper articles,and cultural archives,has been underway for decades.This effort has made these valuable resources publicly available for research purposes.Optical character recognition(OCR)is among the most widely used techniques for converting printed documents into machine-readable formats.The process of converting document images into machine-readable text through optical character recognition provides a practical means of exploring large docu-ment collections using automated tools like text indexing,search,and machine translation.Although OCR engines perform well with modern text,their performance significantly degrades when dealing with historical materials.Furthermore,a significant portion of the text has been processed using outdated digital technologies.Deep learning has made re-markable progress in the field of OCR in recent years.However,OCR applications still often encounter identification errors.Misidentification not only makes the text difficult to read and understand but also diminishes its informational value.In certain fields,such as finance,misidentification can have significant financial implications.As a result,re-ducing the error rate of OCR tasks has become a major concern for both academia and industry.Existing OCR text correction solutions face two major challenges.Firstly,cur-rent OCR text correction solutions primarily focus on correcting OCR text errors in pure text images.When it comes to non-structured document images such as invoices or bank statements,these solutions struggle to utilize the semantic information of the documents themselves for error correction.Secondly,there is a lack of publicly available OCR cor-rection datasets,and the datasets that do exist contain a limited number of samples.This poses significant challenges for model training.To address the first issue,this thesis proposes an OCR text error correction method based on Layout LM,a pre-trained document understanding model.The method utilizes a multi-modal encoder that leverages a spatial-aware self-attention mechanism.This en-ables the model to deeply integrate text,visual,and layout information from image docu-ments,resulting in a fine-grained understanding of the documents.Regarding the second issue,inspired by the task of grammatical error correction,the paper designed a Bidirectional Inference Network with a Critic(Break It and Fix It,BIFI)architecture.Despite the limited data availability,the BIFI architecture achieved impressive results for grammatical error correction.Based on Layout Language Model version 2,a pre-training model for multimodal document understanding developed by Microsoft team,this thesis designed a layoutlm-critic as a discriminator for evaluating the alignment between OCR text,images,and bounding boxes.The BIFI architecture was trained on the basis of layoutlm-critic,and good results were achieved on both data sets SROIE and CORD(Under the unsupervised setting,the F0.5score has been improved by9%;Under the supervised setting,the F0.5score has been improved by 12%). |