| People have many ways of expressing their emotions and opinions in teaching or on social platforms,such as pictures,videos,and audio.Multimodal information can help us to analyze more accurately,but it brings challenges too.Although multimodal contains rich information,the emotions expressed by different modalities may be inconsistent,and the representation learning of each modality will have a direct impact on the fusion effect between modalities.Moreover,the sampling rate of each modal sequence is inconsistent,and how to utilize the alignment information between different modal,that poses a greater challenge to the inter-modal fusion method.Therefore,it is very important to strengthen the representation learning of single-modal information and efficient inter-modal fusion methods.In this dissertation,we mainly uses deep learning technology for multimodal sentiment analysis research and classroom learning status analysis applications.In this dissertation aims at the problems existing in the field of multimodal sentiment analysis,in order to more effectively emotions fuse,we propose a multimodal model that integrates intra-modal and inter-modal relationships between different modalities.Firstly,for different input data forms,the multi-head attention mechanism is used to learn the feature representation inside the single modality,and the information fusion between the modalities is performed.Secondly,the Attention on Attention(Ao A)module is added to enhance the traditional mul ti-head attention mechanism to improve the extraction quality of single-modal data features and strengthen the interaction between modalities.Finally,we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis.Experiments show that our model outperforms over existing methods,we get the following performance on binary accuracy for CMU-MOSEI and CMU-MOSI on both unaligned and aligned:85.3%,85.4%,83.2%,82.6% respectively.And the performance analysis experiments on the MOSEI dataset show that the integrated consideration of intra-modal representation learning and inter-modal information fusion works better.Furthermore,the introduction of the Ao A module can effectively improve the accuracy of the model,up to 0.5%.At present,smart education has attracted much attention.In this dissertation analyzes three key elements by up-to-date AI and deep learning technology,fully considering the subtle impact on knowledge digestion subjected to students ’concentration,teaching quality and knowledge comprehension hardness,preliminary realization of the smart classroom system.At the same time,for the analysis of teacher’s vividness,we analyze the teacher’s teaching amplitude and emotional intensity based on the Open Pose gesture r ecognition model and the bimodal emotion model respectively,which further reflects the vividness of the teacher’s teaching.And we collect teacher videos to produce a small-scale bimodal dataset containing both visual and audio to verify the effectiveness of the bimodal model.The practical application results show that the deep learning model can initially meet the project requirements. |