| With the development of deep learning,significant progress has been made in text detection technology,which has been widely applied in various fields and become one of the current research hotspots.However,deep learning-based scene text detection algorithms often face the following problems due to the complex background and diverse fonts of natural scene images:(1)poor detection performance for closely connected text instances.(2)inadequate feature extraction ability using lightweight backbone networks.(3)a trade-off between algorithm accuracy and speed,where higher accuracy algorithms tend to sacrifice detection speed.To address these issues,this paper proposes a scene text detection algorithm based on non-local attention and feature enhancement.The specific research content is as follows:(1)To solve the problem of poor detection performance for closely connected text instances,this paper combines the differentiable binary segmentation of DBNet and takes the lightweight Res Net-18 network as the backbone network.Furthermore,Global Context Net is incorporated into the feature extraction structure to expand the model receptive field,which can not only captures contextual information in the region but also reduces computational complexity,ensuring the portability of the network.(2)To solve the problem of inadequate feature extraction ability using lightweight backbone networks,this paper replaces the original feature pyramid structure with a feature pyramid enhancement module and a feature pyramid fusion module.The feature pyramid enhancement module can not only propagate high-level semantic features from top to bottom,enhancing the semantic information of the entire pyramid feature,but also propagate the position information from bottom to top,allowing better localization of small targets in the image.The feature fusion module integrates feature information from different levels to improve feature representation,enabling the model to better distinguish between different samples.Additionally,the regular convolution structure in the feature pyramid enhancement module is replaced with depth-wise separable convolution to reduce network complexity while maintaining the accuracy of the model.(3)In scene text detection,due to the small proportion of text regions and the large proportion of negative samples,assigning equal weights to all classes in the binary cross-entropy loss leads to low training efficiency and inability to achieve the expected optimization effect.To solve this problem,this paper replaces the binary cross-entropy loss with Focal Loss.Focal Loss can not only adjust the weights of positive and negative samples,but also dynamically reduce the weights of easily distinguishable samples by modulating the factor during the training process,thereby quickly focusing on the difficult samples that are difficult to distinguish,improving the training efficiency and accuracy of the model. |