Font Size: a A A

Research On Cross-modal Audio Sentiment Classification Based On Deep Learning

Posted on:2021-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:K C YangFull Text:PDF
GTID:2518306461470594Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the popularity of smartphones and the rapid rise of social media,sentiment classification as one of the core technologies of human-computer interaction has attracted more and more researchers' attention.Sentiment classification technology has been successfully applied to many scenes,such as human-computer dialogue,automatic driving,and so on.At present,sentiment classification technology is mostly in the single modality,such as audio modality,text modality and so on.However,the information contained in unimodal is limited,and it is easily affected by noises.This thesis mainly studies the audio sentiment classification and audio-text cross-modal sentiment classification,the main research contents are as follows:(1)This thesis proposes a method of audio emotion classification based on the constant-Q chromatogram.In this thesis,Res Net is used to extract spectral features from the constant-Q chromatogram,and the Contextual Residual LSTM Attention Model is designed for audio sentiment classification task.Most previous studies use audio feature extraction tools to extract corresponding statistical features from audio data,such as MFCC,zero-crossing rate,but these features ignore the most important temporal information of audio modal.Therefore,this thesis uses Res Net to extract the spectrum features with temporal information from the constant-Q chromatogram and uses Bi-LSTM to learn the context information between different utterances.At the same time,Self-Attention is adopted to capture the sentiment salience information.On the international standard open dataset MOSI,we conduct the model comparison experiment and feature comparison experiment.The experimental results show that the method proposed in this thesis achieves the best results.(2)This thesis proposes a heterogeneous feature fusion method for audio sentiment classification.In this thesis,the Residual Convolutional Model with Spatial Attention is proposed to extract the context-independent spectrum features from the Mel-spectrum,and the Contextual Heterogeneous Feature Fusion Model is designed to interact the spectral features and statistical features of audio modal and to predict the sentiment.In previous work,most researchers only use one kind of audio features,such as spectral features or statistical features.However,these features are often heterogeneous and contain different levels of information.Therefore,this thesis designs a feature collaboration attention,which is used to fuse the spectral and statistical features of audio modal,so as to capture more abundant emotional information.On the international standard open datasets MOSI and MOUD,the performance of the proposed method is better than the baseline models.(3)This thesis proposes a cross-modal sentiment classification method for unaligned sequences.Based on the transformer model,this thesis proposes a Self-Adjusting Fusion Representation Learning Model for unaligned cross-modal sequences.Previous work on multimodal sentiment analysis often requires the aligned audio and text features.However,in the real world,audio and text modalities are often unaligned.The method proposed in this thesis can learn the fusion representation directly from the unaligned audio and text modalities data,and adjust the fusion representation by using the audio and text single modal feature representation respectively.On the international standard open datasets MOSI and MOSEI,this method is superior to the benchmark models in all evaluation metrics.(4)This thesis proposes a method of cross-modal sentiment classification for aligned sequences.Based on the pre-training BERT model,this thesis proposes a Cross-Modal BERT model for aligned cross-modal sequences.Previous works only focus on single text modality to use the pre-training BERT model.In this thesis,audio modality is introduced to assist text modality to fine-tune the pre-training BERT model.The Masked multimodal attention is used to fully interact with audio and text modalities,so as to dynamically adjust word weight and obtain better representations.On the international standard open datasets MOSI and MOSEI,this method is superior to the benchmark models in all evaluation indexes.In addition,this thesis visualizes the weight of words and proves the effectiveness of the proposed method by comparing the changes in the weight before and after introduce audio information.Combining deep learning techniques to improve the performance of cross-modal audio emotion classification tasks is very important for the development of artificial intelligence.The experimental results show that the methods proposed in this paper achieve better performance in the corresponding research content and have a certain value.In the end of this paper,the problems encountered in the research process and the prospect of future work are also summarized.
Keywords/Search Tags:Sentiment classification, Multimodal interaction, Contextual information, Attention mechanism, Pre-training model
PDF Full Text Request
Related items