| Document image layout analysis plays a crucial role in recognizing information in document images,and high-level semantic object detection in document image analysis is the core for downstream tasks such as intelligent document editing and understanding.Due to the diversity of document image categories,the complexity of page layouts,and the uneven distribution of target object sizes,document image layout analysis tasks are challenging.Existing detection algorithms rarely consider multimodal information and global dependencies.Therefore,this paper proposes visual and textual based document image layout analysis methods.By integrating textual features into visual features,the complementary information between visual modality features and text modality features is utilized,making up for the shortcomings of single modality features and enriching feature representation information.The fused multimodal features are processed to enable feature information at different levels to be propagated on channels,and enhance the features of target areas.To fully utilize visual and textual information,this paper proposes two document image layout analysis methods:(1)In order to achieve the fusion of single-grain text features and visual features,this paper proposes a visual and textual single-grain document image layout analysis method.The method mainly consists of three parts: feature extraction module,feature fusion module,and feature enhancement module.Firstly,to utilize single-grain textual features at the row level,the text sequence information in the image is converted into a two-dimensional representation.Secondly,this paper proposes to input the text and visual features into the backbone network to extract multi-scale features,and the text features are repeatedly fused during the extraction process to achieve deep fusion of visual and text features.Finally,the fused multimodal features are input into the feature enhancement module to achieve the propagation of high-level semantic information and low-level feature information,thereby enriching the fused multimodal features.The results of comparative experiments show that this method effectively improves the performance of document image layout analysis,achieving performance of 95.86%,96.31%,and 96.06% on three public datasets,respectively.The results of ablation experiments prove that each module plays a positive role in improving the overall network performance,and also prove the effectiveness of the visual and textual feature fusion strategy.(2)In order to utilize multi-grain text features,this paper proposes a visual and textual multi-grain document image layout analysis method.The method mainly includes four parts: visual feature extraction module,which extracts multi-scale features from the image.Multi-grain text embedding module,which applies channel and spatial attention to different levels of text features to explore significant text feature information.Feature fusion module,which uses a feature pyramid network to fuse multi-grain text features and visual features.Feature enhancement module,which achieves feature information propagation on channels,thereby enriching feature representation information.In addition,in the process of processing multi-grain text information,a hierarchical construction of the Pub Lay Net dataset is implemented,including the insertion of semi-structured elements and the extension of annotations.The results of comparative experiments and ablation experiments prove that adding multi-grain text features and adopting appropriate feature fusion and enhancement strategies can effectively improve the overall performance of document image layout analysis.This paper studies the document image layout analysis method based on visual and textual features,aiming to integrate high-dimensional visual and textual features,utilize complementary information between different modal features,enrich feature representation information,and improve the accuracy of document image layout analysis. |