Font Size: a A A

Research On Video-Speech Retrieval Based On Multimodal Feature Memory Library

Posted on:2022-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:J B LiFull Text:PDF
GTID:2518306779496614Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Along with the rapid development of China's digital economy and the steady improvement of urban and rural residents' living standards,many people are increasingly inseparable from smart devices in their daily lives,and consequently generate huge amounts of multimedia data.However,common search engines work in a unimodal manner,which leads to inefficiencies and difficulties in retrieving information in multimodal scenarios.Therefore,cross-modal search is not only an urgent need for convenient information retrieval,but also in line with the development process of the Internet in the new era.However,there is a scarcity of cross-modal retrieval methods that focus on building semantic bridges over video and speech information to achieve cross-modal retrieval of video and speech.To address this situation,this thesis mainly conducts an in-depth study on video-speech cross-modal retrieval,innovatively proposes the concept of feature memory library and designs a video-speech retrieval framework based on feature memory library from the characteristic that human beings have the ability to learn and can remember for a long time.The framework can be divided into feature extraction module,feature mapping and fusion module,and feature memory library module according to the functionality.In addition,a total of two models,CRSML and ILCSML,are proposed in this framework,and their feature mapping and fusion modules are implemented with different methods to optimize them respectively.In the feature extraction module,it is mainly to process the video information and speech information input to the model and obtain the corresponding original features through dual-stream I3 D network and Bi-LSTM,respectively.In the feature mapping and fusion module,firstly,the two modal features obtained from the previous module are mapped into a common semantic space,and then the video and speech features in the common semantic space are fused and processed to obtain a preliminary judgment score.Finally,there is the feature memory library module,in which the original features of the two modalities are computed with the corresponding feature memories library updated according to certain strategies to obtain memory vectors,and then the initial scores obtained from the previous module are corrected by the video and speech memory vectors to obtain the final judgment scores to measure the similarity of video and speech information.This thesis also extends the MPII culinary activity dataset 2.0 with a large number of speech recordings for smooth experiments.Finally,after comparing and analyzing several experiments,the results show that the feature memory-based video speech retrieval framework designed in this paper can not only achieve better video speech bi-directional retrieval effect,but also further validate the effectiveness of the feature memory library by its outstanding performance compared with similar methods in the test set containing unknown category samples.Meanwhile,the improved and optimized feature mapping and fusion modules bring significant performance improvements to the model.
Keywords/Search Tags:deep learning, cross-modal retrieval, feature memory library
PDF Full Text Request
Related items