Font Size: a A A

Research Of Scene Text Recognition Based On Encoder-decoder Architecture

Posted on:2022-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:X C DuFull Text:PDF
GTID:2518306752454134Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of big data and deep learning,text image recognition has an important impact on people's daily lives.This paper focuses on the text recognition task and improves the text image recognition model based on encoder-decoder.Specifically,this paper adopt the feature extraction module based on the attention mechanism and the temporal convolutional network to extract visual features and the modeling of the feature sequence in the encoding stage;the multi-layer feature aggregation mechanism is used to aggregate different levels of information;heuristic local attention mechanism is adopt to decode character sequence in the decoding stage.The experiment proves the proposed model in this paper has more superior performance.Firstly,the visual features of text images play an necessary role in the STR.There-fore,this paper extracts the visual features by employing channel and spatial attention-based feature extraction module.Channel and the spatial attention module enhance the feature at the channel and spatial level respectively.Extensive evaluations have proved that the channel and spatial attention-based feature extraction module can obtain more robust features,which is beneficial to improve the performance of the model.Secondly,this paper adopts Temporal Convolutional Network(TCN)to model the feature sequence.Compared with RNN,TCN can not only process sequence features in parallel,but also deal with the disappearance of information gradients and explosions through the residual structure.The parameters of the TCN in each layer are shared and without saving the information of each time step.More importantly,TCN has more flexible receptive field,and the different number of layers,convolution kernel size and expansion coefficient can be designed according to different scenarios.Thirdly,the multi-level aggregation mechanism is proposed to extend the stan-dard encoder-decoder-based architecture by capturing visual feature of different levels.The standard architecture only uses the deepest visual features for sequence modeling which leads to feature vectors degenerating due to the ever expanding receptive field.Therefore,the multi-level aggregation mechanism proposed in this paper aggregates the visual features of different layers to improve the performance of the model.Finally,a decoder based on heuristic local attention mechanism is applied to decode character sequence.For scene text recognition,it is important to obtain the most relevant features of the character at the current time.Therefore,this paper explores a variety of existing local attention methods and provide complete comparison results.In addition,inspired by the existing local attention mechanism,this paper introduced two heuristic-based local attention mechanisms.Extensive experiments show that the heuristic-based monotonous local attention mechanism achieves the best results.
Keywords/Search Tags:Scene Text Recognition, Encoder-Decoder, Channel-Spatial Attention, Temporal Convolution, Feature Aggregation, Heuristic Local Mechanism
PDF Full Text Request
Related items