Font Size: a A A

Design And Implementation Of Cross-modal Retrieval For Video

Posted on:2021-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ZhuFull Text:PDF
GTID:2428330623468549Subject:Engineering
Abstract/Summary:PDF Full Text Request
In traditional data retrieval systems,queries and keywords are both computed from the same type of media files.In the mobile Internet era,the diversity of media has increased greatly,and it is difficult for the retrieval system of single media form to meet the actual market needs.Tiktok and Kwai,for example,provide video search functions for users,and the key point is that how to confirm the text provided by the video uploaders is correct semantically.The essence of the video search function in Tiktok is a text database,therefore it cannot retrieve the semantic information from the video content.This thesis mainly studies the cross-modal retrieval in the video scene,in which video itself,video frame and text can retrieve each other directly.As research for cross-modal belongs to the interdisciplinary subject,there are many challenges in the interaction of dynamic information of video,static information of video,and text information,thus many novel approaches are introduced.The main works and innovation of this thesis are as follows.(1)Considering the high cost of video related cross-modal data collection,and the existing data sets are not enough to support the training of the cross-modal retrieval model.This thesis proposes to use the relevant data sets and relevant models used in video classification task and video retrieval task to transfer the general knowledge to the cross-modal retrieval model.Except for transferring knowledge through the parameters of related models,the relevant data sets are modified to satisfy the requirement of cross-modal retrieval tasks in the video scene.(2)Because of the discrepancy of different modal data,it is hard to compare each other directly.For video retrieval and cross-modal retrieval in the video scene,how to establish the common sub space,where different modes can fit each other,is the key problem.This thesis proposes a transfer-learning-based method from the source domain to the target domain,which is firstly proposed in domain adaptation.This thesis suggests that different modes in common subspace are regarded as datasets with similar semantic but collected from different environments.Based on this condition,transfer learning is carried out for improving the semantic compatibility between modes.(3)The video contains a variety of information,among which dynamic information and static information are important for video-related tasks,but at the same time,there is certain information that interferes with the performance of the retrieval task.How to select critical information is one of the key issues for video-related tasks.To improve the effectiveness of information in the retrieval system,this thesis proposes to modify the attention method and apply it to the selection of critical static frames and critical dynamic segments for videos,so as(4)To improve the retrieval speed,the final output eigenvectors are hashed,and the hashing process inevitably leads to the loss of information and affect the performance of the model.For keeping the critical information in the hashing process,this thesis utilizes a dictionary-learning-based method with self-learning technology to protect the critical information from the hashing process and then improve the retrieval accuracy as much as possible under the requirement of efficiency.Experiments are carried out on the model designed by this thesis.The experimental results show the effectiveness of the proposed method.Besides,the experimental results of this thesis are competitive with the latest research methods in the benchmark test set.
Keywords/Search Tags:video retrieval, cross-modal retrieval, deep learning, transfer learning
PDF Full Text Request
Related items