| In recent years,with the rapid development of Internet and mobile technologies,the amount of information stored on computers has exploded exponentially.As a carrier of information,the number of scene text images is also growing rapidly.Information is stored and utilized by computers in natural scenes through two-dimensional images based on red,green and blue.Automatic recognition of text information in images by computers has a wide range of applications,such as automatic driving,bill recognition,human-computer interaction,etc.In recent years,the models which have achieved the best results are mostly based on vision and semantics.These methods usually begin by using a feature extractor to extract visual features from two-dimensional images.Then the semantics model(also known as the language model)is used to encode the feature maps of the previous step to obtain the semantic features.Finally,the visual and semantic features are used to obtain the final recognition result.In this way,semantics models are highly dependent on visual features.This method of coupling semantic features with visual features has two disadvantages:Firstly,semantics models become correctors for visual models,which are only used to correct the results obtained by vision models.Secondly,using semantics models to correct the results of vision model can greatly improve the accuracy.However,there is a wide range of wrong text information in natural scene applications,such as handwritten text recognition and marking.For incorrect text,the model recognizes and automatically corrects it,which is a departure from the purpose of our mission.Moreover,due to the serial coupling of vision and semantics model,the model is bloated and difficult to train.In order to solve the above problems,a novel Semantics Independence Network is proposed in this thesis,which separates semantics modules.The semantics module is separated and becomes the equivalent part of the vision model,so that the vision model pays more attention to the two-dimensional visual features and the semantics model pays more attention to the one-dimensional semantic features.In addition,a vision semantics fusion module is proposed to fully interact visual features with semantic features.Through the above two ways,the semantics module can process semantic information independently.Vision and semantics module can be fully decoupled,and the features of the two parts can be fully utilized.In addition,a pruning method for analyzing the parameter redundancy of the scene text recognition model is proposed for the first time.It provides a way to check whether a module should be used when designing a scene text recognition network.After pruning the trained model by modules,the number of parameters is effectively reduced and the function of each module is verified.A redundant parameter pruning method is proposed and layer-aware pruning rate setting is introduced.Through the post-pruning of the proposed semantics independence scene text recognition method,the validity of the proposed semantics module and fusion module and the redundancy of Transformer network parameters are analyzed.The main contributions of this thesis are summarized as follows:(1)A text recognition network based on semantics independence is proposed,which is different from the previous model that decoupled vision model and semantics model by truncating gradient.Instead,it adjusts the structure of the model to achieve complete decoupled structure.(2)A new visual features and semantic features fusion module is designed,which makes the visual features and semantic features fully interact and make full use of the visual features and semantic features.(3)A new pruning method for redundant parameters is designed and is applied to the text recognition network.It prunes each module of the text method which we proposed above and analyzes the redundancy of each module.(4)A layer-aware pruning rate setting is introduced into the pruning method mentioned above.By considering the difference between different layers to be cut into the pruning method,different layers are pruned differently. |