| Documents are usually composed of structured regions such as paragraphs,titles,graphics,tables,and lists,and have rich semantic information and structural features.Document layout analysis aims to extract and recognize these structured regions,which is a critical and challenging step for many downstream document understanding tasks.In recent years,with the rapid development of deep learning and convolutional neural networks,computer vision-based methods have been widely applied in many fields.Although vision-based single-modal detection methods have achieved some success in document layout analysis tasks,there are still problems of insufficient utilization of semantic information in documents and difficulty in capturing the contextual position and structural features of document components,which have some impact on the recognition and location of document components.Based on these two problems,this paper proposes a text feature extractor module to model document semantic features,integrates visual and semantic features,and uses an attention mechanism in the network model output stage to learn the structured features between document components,thus effectively utilizing the structured information in the document to improve the accuracy of recognition and location.Additionally,this paper designs and implements prototypes of three document layout analysis scenarios,with the main optimization and improvement work including the following three points.First,the paper proposes a method for modeling document semantic features,based on supervised cross-modal representation learning,building and training a text feature extractor to learn semantic feature representations of text information from document images.The dualstream convolutional network is implemented to effectively fuse visual and semantic features,enhancing the network’s feature extraction capability.Secondly,this paper uses the attention mechanism to model structured features between document components.This was achieved by designing a classification and regression network structure based on the attention mechanism,allowing the network to learn document-structured feature representations and optimizing the document layout analysis results.Thirdly,based on the above research results,this paper conducts ablation experiments on PublayNet,Article Regions,and the benchmark dataset.The results show that compared to the baseline method,the proposed improvements have different levels of improvement in terms of class average accuracy.Additionally,based on the network model implemented in this paper and combined with application scenarios,the paper designs and implements a multi-modal instructional layout detection system,including three scenarios:general English document layout detection,exercise books layout detection,and examination answer card layout detection.The prototype system realizes the application of text research work and provides visualization and editing functions for layout detection results,which is simple and user-friendly. |