As one of the key carriers for information exchange and knowledge transfer,images have an irreplaceable position in the Internet field.With the accelerated advancement of informationization in society,various image information is increasingly expanding.Image classification algorithms can help people process images quickly and efficiently.Image classification is considered an important task in the scope of computer vision and is commonly used in various fields such as transportation,security,and satellite.Among them,handwritten Chinese character image recognition and classification is involved in many fields such as document entry,logistics sorting,and test paper correction,which has very important research significance and value.Offline handwritten Chinese characters have complex structures,many large categories,and many similar characters that are easy to be confused with,etc.These characteristics make the task of handwritten Chinese character image recognition and classification challenging and have always been a difficult and important point of research.In recent years,the Transformer models,which rely only on attention mechanisms,have been used in processing vision tasks such as image classification with certain achievements,showing endless application potential in the field of artificial intelligence.However,the transformer-based model suffers from the problems of a large number of model parameters,large model operations,high memory consumption,and slow training speed.In order to effectively solve the problems faced by transformer-based models,this paper focuses on the application of transformer models to image classification tasks by optimizing the compression design of transformer-based model structures and applying the models to handwritten Chinese character image recognition classification tasks.Among them,the research work in this paper mainly contains the following parts:(1)First,a parallel fast vision transformer(PF-Vi T)network is designed to improve the model’s training speed.The PF-Vi T model contains three parallel Vision Transformer structures,which are two-way parallel,four-way parallel,and seven-way parallel.In these three structures,transformer encoders are arranged in parallel to achieve multiple parallel encoders to process image block sequences simultaneously using the multi-head attention mechanism,which improves the image processing efficiency.Finally,the two-way parallel Vision Transformer model with six layers of encoders each achieves 98.6% accuracy on the dataset.In addition,the FLOPs have 8.52 G.The total number of parameters is 85.62 million.(2)Then,a simplified Swin Transformer(S-Swin Transformer)network is proposed in order to improve the perceptual field of window attention and reduce the number of model parameters.The number of structural layers is simplified by properly optimizing the structure of the Swin Transformer model.The most critical thing is to increase the size of windows in the window attention.The number of image block patches covered by the window increases,so that more image block patches can interact with each other through the attention mechanism.Ultimately,it is experimentally demonstrated that expanding the window size in window attention can improve the performance of the model.Moreover,the number of proposed S-Swin Transformer network parameters is greatly reduced to only 8.69 million,and high accuracy is obtained on the dataset.(3)Finally,a lightweight Vision Transformer(LW-Vi T)model is proposed in order to reduce the model’s complexity and memory footprint and make it easier to piggyback on small devices.The LW-Vi T model is based on the framework structure of the Mobile Vi T model,with appropriate optimization of its structure.The proposed LW-Vi T model reduces the MV2 layer and LW-Vi T block.Among them,the MV2 layer is the reverse residual block.Ultimately,the LW-Vi T model has a significantly lower number of parameters.The number of parameters is only 0.48 million,and the FLOPs are 0.22 G.Meanwhile,the model achieves good accuracy. |