With the development of multimedia technology and the advent of the digital era,it is imperative to introduce modalities such as vision,touch,hearing and smell into the virtual digital space in order to guide users to obtain a comprehensive and diversified immersive experience.Therefore,the research of object recognition and perceptual understanding from a multimodal perspective has become an urgent research issue.However,the existing object recognition and data restoration compensation schemes are usually implemented through machine learning or deep learning methods in a single modality,whose essential drawback lies in the inability to fully utilize the complementary nature of multimodal heterogeneous information.To address these above problems,this thesis firstly combines the two modalities of vision and haptics to carry out the research and analysis on object recognition,learns and estimates the mechanical representation of the object surface by combining the characteristics of the content between the modalities,deeply explores the correlation between the modalities.Then,this thesis realizes the united restoration and enhancement of visual images and haptic signals,and effectively improves the sensory experience of users.The specific research results of this thesis are mainly divided into the following aspects:(1)This thesis first summarizes the problems and opportunities faced by object recognition and perceptual understanding from a multimodal perspective.Through the analysis of Digital Twin and bionic technology,this thesis analyzes the significance of object recognition and perceptual understanding for enhancing the relationship between the virtual world and the real world.Secondly,through the study of deep learning and multimodal theory,this thesis further verifies that the semantic understanding and knowledge representation of multimodal data synthesis can enhance the sensory experience of users.To this end,this thesis takes multimodal fusion enabled object recognition and enhancement technology as the starting point of the research,and designs schemes to solve the problem on this basis.(2)This thesis proposes a multimodal recognition method,which can automatically extract multiscale image and haptic features for multimodal fusion.The maximum voting method is used to obtain the predicted labels of the input individuals.This method can deeply mine modal semantics,complete the data mapping between multimodalities,and establish the correlation between multimodalities.In addition,this method also uses the idea of transfer learning at the data level to strengthen the commonness between heterogeneous data,accelerate the network training process,and improve the performance of the depth model.Experimental evaluations from publicly available datasets show that the proposed method can achieve superior classification accuracy with reduced time complexity compared to other existing object recognition methods.(3)This thesis presents a cross-model inpainting scheme.First of all,the mapping of object material parameters is completed through the multimodal fusion network proposed in this thesis.Then,the visual inpainting network is used to standardize the encoding distribution during the training process to ensure that it has an interpretable and exploitable structures in the hidden space,and gradually construct the detailed features of the visual image.At the same time,the haptic inpainting network is used to construct the generator and discriminator for confrontation training,and gradually fit the distribution of the haptic signal.Finally,the united inpainting of visual image and haptic signal is realized by using the cross-model inpainting network.Experimental evaluations on publicly available datasets show that the proposed scheme can achieve better performance with preferably speed compared to other existing visual and haptic inpainting schemes. |