Text in natural scenes is a common source of information that plays an important role in assisting computer vision systems in scene understanding.Using scene text detection and recognition technology to obtain text information from images and videos has become a research hotspot in the fields of computer vision and document analysis.Many research results have been widely applied in auto-pilot,scene parsing,image retrieval and many other fields.Many state-of-the-art text detection and recognition methods are optimized for publicly available datasets with sparse and limited Chinese text instances.In the dense Chinese text scene,accuracy and latency of the deep learning based text detection and recognition methods still needs to be optimized.Based on the Chinese text detection dataset of reading scenes and the synthetic Chinese text recognition dataset,this thesis optimizes the performance of text detection and recognition algorithms in dense text scenes.The specific work is as follows:(1)Existing text detection methods are weak in extracting features of dense text objects.To solve this issue,this thesis proposes a multi-scale feature based text detection method.It designs a multi-scale feature fusion module and a grouping spatial attention feature enhancement module,which are used to reduce information loss during feature sampling,enhance important multi-scale features and suppress noise.This method makes full use of high-level semantic features and low-level detail features,thereby enhancing the model’s feature representation ability and effectively improving the detection accuracy of dense text objects.(2)Due to the poor real-time performance of the multi-scale features based text detection method,this thesis proposes model compression methods based on lightweight structure design and structured pruning.Firstly,this thesis proposes a lightweight text detection model by structure designing,manually reducing the model’s parameter and computational complexity.Then,this thesis proposes a channel attention module to calculate the channel weights of the feature map as a criterion for pruning convolution kernels.Finally,this thesis compresses the lightweight text detection model by structured pruning,and obtains a scene text detector that balances accuracy and speed.(3)In typical text recognition framework,the sequence modeling operations may lead to the loss of high-level semantic structure information.To solve this problem,this thesis proposes a shifted windows multi-head self-attention based text recognition model.This model removes the convolutional neural network in the classic framework,uses Transformer to extract spatial features and perform sequence modeling.Finally,a CTC decoder is applied to align the output prediction sequence in this model.This model simplifies the typical framework of text recognition methods and avoids the information loss caused by modeling feature maps into feature sequences. |