| With the continuous advancement of educational modernization,the demand for large-scale education and personalized training has become increasingly urgent.In order to meet the personalized learning needs of learners,it is necessary not only to integrate massive educational resources into educational platforms,but also to gain a deep understanding of the objects in different subjects,such as mathematical expressions,chemical molecular expressions,music scores,and so on.Due to the complexity and diversity of these subject objects,accurately understanding various types of educational resources has become a very challenging task for computers.Typically,objects in various subjects exist in the form of structured images.Therefore,in educational scenarios,recognizing structured images can not only provide technical support for intelligent education,but also have a positive effect on the research of multiple cross-disciplinary technologies such as computer vision and natural language processing.Mathematical expression images are the most representative structured images,which not only contain a large number of letters,numbers,and operators,but also have complex layout structures.Compared with traditional OCR,math expression recognition not only needs to correctly identify all the symbols in the image,but also needs to determine the two-dimensional structural relationship between symbols.In real educational scenarios,handwritten math expressions are widely used.Due to the randomness and individual differences in writing,handwritten mathematical expression recognition has become an even more challenging task.With the development of deep learning technology,more and more research works are using encoder-decoder models to recognize mathematical expressions.However,deep learning methods for recognizing mathematical expressions mainly face five major challenges: obtaining fine-grained features from images,extracting complex positional relationship features,attention mechanism drift,ambiguity in the decoding process,and generalization errors in the decoder.Based on the encoder-decoder architecture,this paper proposes four models for recognizing printed and handwritten mathematical expressions.The main contributions of this paper are summarized as follows:· To extract fine-grained features from images and position dependent relationships between adjacent symbols,this paper proposes an Encoder-Decoder architecture with Symbol-Level features for printed mathematical expression recognition(EDSL).This model employs an unsupervised symbol segmentation method to divide printed expressions into symbol blocks,thus extracting symbollevel fine-grained features from the image.Additionally,the symbol-level encoder uses a position correction attention mechanism to reconstruct the positional relationship between symbols.Through extensive experiments,the superior performance of EDSL in printed mathematical expression recognition is verified.· To improve the robustness of fine-grained feature extraction and the representation ability of symbol position features in two-dimensional space,this paper proposes a Dynamic f EAture se Lection module for printed mathematical expression recognition(DEAL).Due to the poor robustness of unsupervised symbol segmentation methods,the dynamic feature encoder first uses a small receptive field convolutional neural network to generate large-scale feature maps to retain fine-grained symbol features.Then,by dynamically selecting features,the invalid features in the feature map are removed,which reduces the computational cost of the model without affecting its recognition performance.In addition,the dynamic feature encoder also supplements the position encoding representation from three dimensions: absolute position in two-dimensional space,relative position relationship in two-dimensional space,and two-dimensional position environment feature.Through extensive experiments,it has been verified that DEAL not only further improves the performance of printed expression recognition but also has good recognition accuracy in music recognition tasks.· To address the attention drift and decoding ambiguity issues,this paper proposes a Symbol Location-Aware Network for handwritten mathematical expression recognition(SLAN).The model proposes a counting method based on symbol relationships to identify the symbols of feature map in a weakly supervised manner.After determining the symbol positions,the model rearranges symbols in the feature map as a draft sequence to provide a global context for the decoder.In addition,the model uses the symbols in the feature map as pseudo-labels and aligns the decoder outputs with the image using a dynamic programming algorithm.The alignment between attention mechanism and feature map can be trained in a supervised manner.Experimental results show that SLAN significantly improves the accuracy of handwritten expression recognition.· To address the generalization errors of the decoder,this paper proposes a SemiAutoregressive tree decoder for handwritten Mathematical Expression Recognition(SAMER).From a global perspective,the decoder generates the symbol layout tree layer by layer in a self-autoregressive manner.For each layer’s nodes generalization,the decoder generates all the nodes in a non-autoregressive manner.To address the ambiguity issue in tree decoders and enhance the model’s generalization performance,SAMER integrates global masking tasks and reverse generation tasks.The global masking task can be used to provide global context information to the decoder in mutual learning.Experimental results demonstrate that SAMER improves the tree decoder’s generalization performance,solves the sub-node locating problem in depth-first traversal,and achieves the best recognition performance in handwritten expression recognition tasks.In summary,this paper proposes mathematical expression recognition methods based on encoder-decoder architecture to address the challenges of obtaining fine-grained features,extracting complex positional relationship features,attention drift,and decoding ambiguity in expression recognition tasks.Compared with existing methods,the proposed methods achieve significant improvements in model accuracy for both printed and handwritten expression recognition tasks.Furthermore,the proposed methods are transferable and can be applied to other structured image recognition tasks,such as music recognition and chemical expression recognition. |