| With the continuous development of intelligent voice interaction applications,the scenarios and environments involved in interaction have become increasingly complex,and people have increasingly high requirements for an interactive experience,which poses a significant challenge to intelligent interaction technology.Influenced by the rapid development of deep learning technology,many researches based on a single modality have achieved significant progress.Although intelligent interaction with machines can be established based on a single modality,people prefer machines to serve humans in a more natural way through multiple modalities,just like humans.In real life,humans live in an environment where multiple modalities interact with each other,completing interaction and information exchange by fusing information from different modalities.Although the information of different modalities varies greatly,they all contain specific information,and multimodal information fusion can provide more abundant and comprehensive information than a single modality.Reasonable multimodal information fusion can help people better understand the issues of interest,and help analyze and improve the performance of model systems.In addition,research based on multimodality can be more realistic and improve the interaction experience.Among them,audio modality and video modality,as the two most commonly used and important modalities of human nature,have received extensive attention.Based on the above status and research background,in order to solve the problem of multimodal audio and video fusion and application in complex environments,this thesis studies audio and video fusion methods and model pruning techniques based on deep learning from aspects of audio and video fusion methods and practical applications.Firstly,there is a correlation between information in different modalities.Modeling the fused representations hidden in different modalities can improve the performance of multimodal applications.However,how to conduct reasonable information fusion is still an outstanding issue.Audio and video are the two most commonly used and important modalities in nature,so the selection and design of reasonable and effective audio and video fusion methods have become a difficult problem.Inspired by this,this thesis explored audio-video fusion methods based on deep learning,and introduce using a factored bilinear pooling(FBP)technology for modeling,which is validated in audio-video emotion recognition tasks.Through the implementation of experiments and in-depth analysis,it is not only shown that FBP can achieve better performance compared to simple fusion methods but also revealed through sample analysis that audio and video modality fusion can achieve the advantages of complementary audio and video modality information.Based on the emotional classification results of audio and video systems,this thesis further improves this FBP fusion method.In view of the differences in the contribution of different modalities to final decisions in audio and video modalities and the consideration of fine-grained fusion modeling,this thesis proposed a bilinear pooling fusion method based on adaptive and multi-level FBP.This thesis conducted validation on two public datasets of audio and video emotion recognition tasks and used the proposed improved FBP method for model training.Experimental results show that the proposed method achieves consistent performance improvements on both datasets.In addition,this thesis also analyzed the accuracy of different modal adaptive weights combined with different emotional classification,and visualized the embedding of hidden layers in the network.The results show the advantages of the proposed improved method.Secondly,the rise of deep learning is closely related to big data.Through experimental research,Google researchers have pointed out that the success of visual deep learning is not only due to the progress of model networks and the improvement of hardware computing resources but also due to large-scale tagged data.In addition,the recently exploded ChatGPT also relies on a large amount of data for parameter training.For audio and video multimodal tasks,due to complex recording conditions,high recording,alignment,and labeling costs,and other factors,the amount of available audio and video data is often very limited,while the corresponding amount of audio single mode data is usually very rich,which can provide significant single mode data driving potential.Inspired by this,this thesis proposes a cross-modal teacher-student learning information fusion method based on Kullback Leibler(KL)regularized cross-entropy,which combines the rich acoustic information in massive audio single modality data with the advantages of video modality robustness to noise to further improve the performance of audio video fusion systems.Considering the characteristics of the emotion recognition task itself,the amount of data in a single modality is extremely limited.Based on the factorized bilinear pool fusion network,this thesis conducted verification in two tasks:audio-visual voice activity detection and audio-visual wake word spotting,and the results showed the effectiveness of the proposed method.Finally,this thesis considers issues in the application of the model.Although the introduction of video modalities improves system performance,it also greatly increases the number of system parameters.However,the reasoning and calculation of complex models have extremely high requirements for computers,such as the need to consume a large amount of memory and also generate significant computational losses,which are extremely detrimental to the actual deployment of models,especially for resourceconstrained devices such as mobile terminals,How to efficiently model compress audiovisual systems is also a problem that academia and industry are very concerned about and want to solve.And this thesis proposes a model pruning method based on iterative fine-tuning and lottery hypothesis for designing compact audio-visual systems and implementing pruning on the two tasks of audio-visual voice activity detection and audiovisual wake word spotting in a cross-modal teacher-student learning information fusion framework.Pruning is first implemented in both single modalities,including audio and video systems,respectively,and the results show the effectiveness of the proposed method.Further,model pruning is performed on the proposed cross-modal teacherstudent learning information fusion framework.According to experimental results,the proposed model pruning method can achieve greater model compression while achieving performance not lower than the original nonpruning network model.This thesis has also implemented validation on different network frameworks and data types,and further proposed a pruning approach for end-to-end audio-visual fusion systems.The results show that the proposed method can provide potential product solutions for the deployment of audio and video models. |