| The image-text cross-modal retrieval task refers to using the data of one modality as a query to retrieve the data of another modality related to it.Semantic representation learning refers to the encoding and mapping of the semantic information contained in the data into highdimentional vectors.In algorithm applications,a retrieval request will provide one query data and multiple candidate data.The retrieval model first encodes the query data and candidate data into query vectors and candidate vectors in the same feature space,and then the model will use the distance metric between the query vector and the candidate vector in the feature space as the correlation ranking,and finally return the sorted search result.In the existing research work,the retrieval model is mainly improved from two parts:image-text feature extraction and image-text semantic alignment.In terms of image-text feature extraction,two encoding frameworks are mainly used in existing work: dual-encoder and cross-encoder.The dual encoder encodes image and text independently,and there is no interaction between the two modalities,which is the most direct and fast encoding method.The cross-encoder encodes a set of image-text pairs at the same time,resulting in dense interaction between the two modalities,which is a slow but high-precision encoding method.However,due to the high computational cost,the cross-encoder cannot encode a large number of samples at the same time due to the computational resources,and the limited observable samples affect the model optimization.Due to the lack of interaction between modalities,dual encoders face the problems of low accuracy.This study is devoted to constructing a better encoder framework in the feature extraction.In terms of image-text semantic alignment,the supervision information for model training comes from a large number of annotated and matched image-text pairs.These labeled image-text pairs can construct a large number of positive and negative samples for model training.The loss function aims to make the positive sample image-text pairs close in the feather space and make the negative sample image-text pairs distant in the feather space.However,the existing annotation form is binary label,that is,images and texts are completely related or completely irrelevant,and there is no intermediate situation.Therefore,this training label is not accurate enough,which affects the model optimization.This thesis is devoted to constructing a better optimization objective in image-text semantic alignment.Based on the above summary and observations,this thesis proposes three image-text retrieval algorithms:1.An image-text retrieval algorithm based on asymmetric dual-encoder.The algorithm is used to improve the process of image-text feature extraction,and the proposed asymmetric dualencoder is an improvement to the cross-encoder.The algorithm uses the dual encoder to assist the cross-encoder for encoding,and achieves higher retrieval accuracy than the cross-encoder.The effectiveness of the proposed algorithm is verified on the public dataset.2.An image-text retrieval algorithm based on knowledge distillation.This algorithm explores the application of knowledge distillation in image-text retrieval.The cross-encoder is used as the teacher model and the dual-encoder is used as the student model.The dual-encoding is significantly improved through knowledge distillation.Furthermore,an interactive knowledge distillation method is proposed,which further improves the effect of knowledge distillation by designing multiple representations for images.Systematic experiments on public datasets show that interactive knowledge distillation is effective in improving image and text retrieval accuracy.3.An image and text retrieval algorithm based on soft label learning is proposed.This algorithm designs a soft label generation method,which can generate soft labels containing semantic knowledge for image-text pairs with almost no increase of computation,which can replace the original The binary labels.Systematic experiments demonstrate that soft labels can improve the retrieval performance of the model.In summary,this study proposes three algorithms considering the challenges of existing retrieval tasks.Compared with the baseline model,better retrieval accuracy is achieved.This research will promote the research on cross-modal image-text retrieval and provide support for industrial applications. |