With the advent of the era of big data,test paper document images have enriched the diversity of educational resources,which have been widely used in scenarios such as teacher composition of test papers and construction of test question banks.At present,information is extracted from test paper document images mainly through manual entry.This method has the disadvantage of a large loss of human resources.In order to save resources,it is necessary to recognize the image of the test paper document and realize the automatic extraction of information.In recent years,the emergence of deep learning methods has made great progress in the related research of document image recognition.However,due to structural characteristics such as complex layout of test paper documents and changeable document components,it is still a huge challenge for high-precision test paper document images to conduct document conversion research.This paper conducts in-depth research on three aspects of document component detection,OCR,and NLP applied to test paper document images in the field of deep learning,which proposes a method of multi-model stacking and layer-by-layer output test paper document image conversion test paper documents.The main work of the paper is as follows:(1)A layout dismantling strategy based on PaddleOCR is proposed,which simplifies the process of document image recognition data processing.Collecting3550 test paper document image data from the Internet,a semi-automatic labeling training method is proposed.Compared with manual labeling,this method increases the labeling rate by 195% and shortens the period spent on labeling by 86.47%.Based on this experiment,the performance analysis of various lightweight object detection models applied to document component detection is carried out.Using LCNet to replace the backbone network of Pico Det-S-ESNet,in small sample data,the model training speed increased by 150%,and the m AP performance increased by 3.9%.(2)Carry out research on document disassembly components and OCR,and propose a column component disassembly algorithm based on PaddleOCR technology.In addition,a text area limitation algorithm is proposed to improve the PaddleOCR algorithm to apply the test paper document image.The PaddleOCR algorithm outputs the test paper text data set,which fills the vacancy of the test paper text data and provides a data basis for subsequent semantic analysis.(3)Apply the NLP model to analyze the semantics of the test paper text dataset,propose a text classification algorithm based on the first line label,formulate the definition of "first line" in the title text,and manually label 10,000 text data in the test paper text data set.After experimental verification,the DBCNN algorithm applied in the test paper text scene has an accuracy rate of 98.70% for the positioning of the first line of the question,accurately and effectively locating the test questions.Design a hybrid model structure and integrate multi-modal algorithms to realize a test paper document image recognition system with typesetting recognition function. |