A variety of sensory information(such as vision,hearing,touch,etc.)can help us better sense the surrounding environment and achieve more accurate and efficient task performance,while machines with multiple sensing devices should also possess comparable capabilities.It is also a vital step of moving towards general artificial intelligence.Conventional multimodal machine learning methods are limited by factors such as representation learning,fusion mechanism,and data scale,etc.Such methods have a large deficiency in the joint learning between typical image,sound,and text modality.Therefore,it just focuses on simple lab conditions w.r.t.specific task objectives.At present,with the development of semantic representation ability of machine models and the growth of data scale,the related requirements of machine multimodal perception are showing the characteristics of semantic fusion,realistic scenario,and diversified tasks,therefore also bring many crucial problems that need to be solved,such as proposing more effective intermodal fusion mechanism,exploring new data-driven learning paradigm,and expanding real-world scenario applications for machine multimodal perception,etc.Inspired by the multi-channel perception theory of the brain,this thesis studies machine multimodal perception theory and related applications,and several new methods and results have been obtained.The main contributions of this paper are as follows:1.A recurrent temporal model based on multimodal restricted Boltzmann machine is developed.This model expands the existing probabilistic deep networks to sequence modeling,where the shared hidden layers of the Boltzmann machine at each time are connected backward to model the entire multimodal sequence.Taking the joint probability maximization over multimodal sequence as the learning objective,the specific methods of representation reasoning and parameter learning are derived.Since the Boltzmann machines at different times transmit information only by the shared layer connection,the reasoning and learning are simpler and more effective than traditional ones.And experiments confirm its validity of temporal modeling and it can overcome the audio noise interference to some extent.2.A semantic similarity objective for multimodal learning is proposed.The objective employs the essential attribute of semantic similarity between multimodal data to develop joint learning criterion for multimodal model.By making the high-level representations of different modalities have similar activations,unimodal noise effects can be reduced,thereby enhancing the effectiveness of the joint representation.The experimental results show that the semantic similarity learning objective can be used in a variety of deep fusion networks and related tasks,and has a consistency improvement over the traditional maximum likelihood learning.3.A dense multimodal fusion method for hierarchically joint representation learning is developed.In order to possess the advantages of early and intermediate deep fusion,this method proposes to hierarchically fuse different modal networks.The high-level joint representation not only focuses on the fusion in the same layer,but also depends on the lower-level ones.Based on analysis,the multi-path of correlation learning contributes to multimodal correlation learning and cross-modal supervised learning.Experiments show that such fusion strategy has the characteristics of faster convergence,lower training error and higher precision.4.A new task of image2song retrieval and related implementations are proposed.The retrieval task is proposed for meeting the requirements of multimedia presentation.In order to achieve this task,a series of methods such as constructing large-scale imagelyrics database,modeling the correlation between lyric sequences and image contents,and shrinking the content gap with tag attention mechanism are proposed.On the built testing database,the model can return proper songs whose lyrics fit the image query,which then meets the practical requirements of multimedia presentation to some extent.5.A self-supervised audiovisual learning method based on deep multimodal clustering is proposed.To solve the problem of audiovisual learning in complex scenarios,this method proposes to adequately excavate audio and visual components and perform elaborate correspondence learning among them,where sets of clustering with multimodal vectors of convolutional maps in different shared spaces are synchronously performed.Such integrated multimodal clustering network can be effectively trained with max-margin loss in the end-to-end fashion.Experiments show that such model has semantic discriminative ability for unimodal data and can be also used in various practical audiovisual tasks.6.Machine cross-modal perception models in the blind audiovisual environment are proposed.Inspired by the auditory sensory substitution device,two distinct cross-modal perception models w.r.t.the late-blind and congenitally-blind condition are proposed,which aim to generate concrete visual contents based on the translated sound.By conducting sets of experiments on a variety of improved visual-to-auditory encoding schemes,machine models are validated to provide reliable assessment results compared with humans.As such machine models enjoy the advantages of efficiency,economical,and convenience,it can dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability. |