Transformer is a deep neural network based on self attention mechanism,which was initially widely used in the field of natural language processing.In recent years,some researchers have been inspired by Transformer and started applying it to the field of computer vision.Compared with traditional convolutional neural networks,due to their powerful contextual global modeling ability,various Transformer based deep vision models have achieved excellent performance in different computer vision tasks.However,Transformer lacks the ability to analyze local information,causing it to ignore important local features in the image.At the same time,this architecture model models sparse and redundant features,resulting in a large number of unnecessary calculations and multiple parameters.Therefore,this article analyzes and optimizes the problems mentioned above in Transformer by combining convolutional neural networks.The main research content is as follows:(1)Designed an MFTNet network model.Firstly,in response to the lack of local information induction ability in Transformer,a multi-scale convolution module is designed to extract multiscale features from the image before features participate in contextual modeling,making the features involved in modeling have locality and richness,which can effectively improve the utilization of image information;Secondly,in view of the fact that the model parameter optimization cannot be effectively backpropagated to the multi-scale convolution module due to the model being too large,a multiscale loss function is designed to make the loss of the multiscale convolution module participate in the overall model parameter optimization,thus making the multiscale convolution module and the Transformer module have the same optimization direction;Finally,the Caltech-256 and Image Net-100 datasets were used for testing,and the results verified the feasibility of the MFTNet network model.Compared to other network models,MFTNet also has advantages in the number of iterations required for convergence.(2)A Kmeans-DETR network model was designed.Firstly,in response to the modeling of sparse and redundant features in DETR when used for object detection tasks,resulting in a large number of invalid calculations and excessive model parameters,a local Kmeans clustering method was designed based on the working principle of convolutional neural networks to reduce the number of sparse and redundant features;Secondly,a multiscale local Kmeans clustering method was designed to solve the problem of information inconsistency in local Kmeans clustering,while introducing multiple scale information of images;Then,in order to adapt to the spatial position information changes of features after multiscale Kmeans clustering,corresponding position encoding methods were designed;Finally,experiments were conducted on the COCO dataset,and the experimental results showed the object detection ability of the proposed model,with a decrease in the number of model parameters.(3)An object detection system based on Kmeans-DETR is designed and implemented.At first,the Kmeans-DETR network model for object detection is constructed,and then the different modules of the system are introduced.Finally,the case shows that the system is efficient and feasible. |