Font Size: a A A

Research On Cross-modal Retrieval Of Speech And Image Based On Deep Neural Network

Posted on:2020-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:M GuoFull Text:PDF
GTID:2428330623455825Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,a large number of multimodal data such as text,image,audio and video are published and transmitted on the information network every day.These huge and complex data not only bring rich content,but also bring great challenges to information retrieval,especially the mutual retrieval between different modalities of data.Currently,the mainstream search engines generally use single-mode information retrieval,but they can only obtain partial characteristics of cognitive objects,which leads to the limitation of data analysis.Due to the correlation between multi-modal information,the multi-modal processing is conducive to the comprehensive utilization of all kinds of information contained in the data,and to solve the limitations of low accuracy and poor adaptability of single-modal information processing.Therefore,how to directly utilize the rich information content of multi-modal data,construct multi-modal feature extraction and its correlation representation,and realize the mutual retrieval of multi-modal information is an urgent problem to be solved in current applications.Among the multimodal data,the two most important ways for human beings to understand the world and obtain information are "seeing" and "listening".With the development of speech technology,speech input equipments are getting popular among many areas and speech has been widely used in modern information networks such as mobile phones,computers,household appliances,Internet of Things and so on.Speech-image retrieval is also of great practical value in people's daily life,such as early childhood education,visual impairment assistance,intelligent speech interaction,translation without bilingual annotations between different languages,etc.Speechimage cross-modal retrieval is an issue of great research value and technical challenges.Therefore,this paper focuses on two common and abundant modal data,image and audio.To achieve cross-modal retrieval of speech and image,we must first represent the speech and image separately by feature vector,that is,mapping the speech to the speech feature space and mapping the image to the image feature space.But the feature spaces of the two modalities are not directly related.How to associate these two heterogeneous data in feature space is the key to realize image-speech cross-modal mutual retrieval.Conventional machine learning methods are difficult to implement,and the deep learning methods developed in recent years provide a feasible technical way to solve this problem.Unlike conventional machine learning methods,deep learning networks can map their feature vector spaces to the same multi-modal space by training a large number of speech-image sample pairs,so that different features can be computed in the same framework,thus realizing the correlation between features.According to the above ideas,and aiming at the cross-modal retrieval of speech and image,this paper carries out research from three aspects,the main work is as follows:1)Multimodal data similarity measurement.For cross-modal retrieval,there is a "heterogeneity" gap between different modal data.How to measure the similarity between different modal data is a key issue.In this paper,a method of similarity measurement for multi-modal data is proposed.The neural network is utilized to extract the high-level semantic information of speech and image.The speech features and image features are fused,and then the similarity of the fused multi-modal features is measured,which improves the performance of cross-modal retrieval.2)Key information screening and redundancy elimination.There are both key and redundant information in audio descriptions.How to recognize key information in speech and extract effective features is one of the key issues in cross-modal retrieval of speech and images.In this paper,a method of speech key information screening and redundancy elimination is proposed.One-dimensional convolution neural network and Mel-frequency cepstrum coefficients are used to extract the audio features,which effectively improves the accuracy of cross-modal retrieval.3)Image-speech cross-retrieval modeling.For the two modal data of speech and image,how to model the non-linear correlation between the two modalities is a key and difficult problem.In this paper,an image-speech cross-retrieval modeling method is proposed.The deep neural network is used to fit the non-linear correlation between speech and image.Through paired training of speech-image data,the correlation model between speech and image is directly established to realize the cross-retrieval of two modalities.In this paper,cross-modal retrieval is studied.Experiments are carried out on remote sensing image-speech datasets and natural scene image-speech datasets.The experimental results show that the average accuracy of the proposed algorithm is 5.54% and 3.71% higher than that of the traditional algorithm on the Mirflickr 25 K and MSCOCO datasets.The proposed speech-image cross-modal retrieval algorithm makes the human-computer interaction process more convenient,and even can achieve emotional interaction.It is an efficient,practical and fast information retrieval direction.
Keywords/Search Tags:Cross-Modal Retrieval, Convolutional Neural Network, Multi-Modal Learning, Mel-Frequency Cepstrum Coefficients, Speech Signal Processing
PDF Full Text Request
Related items