As one of the most fundamental characteristics that distinguishes intelligent life forms from other life forms,sentiment is an integral part of daily conversations.In this thesis,sentiment analysis is applied to English teaching to make English learners better able to read English aloud.Sentiment analysis model can be generally divided into two categories:unimodal and multimodal.Research in unimodal sentiment uses only raw audio signals or text,whereas research in multimodal sentiment leverages both audio signals and lexical information,and in some cases,visual information.Speech emotion analysis is a difficult task due to the complexity of emotions.Its performances are heavily dependent on the effectiveness of emotional features extracted from the speech.In this thesis,we proposed the dual attention-based bidirectional long short-term memory networks(DABLSTM),which can take advantage of the strengths of raw audio signals,in which extracts log mel-spectrograms and MFCCs from audio simultaneously.Experiments on the IEMOCAP databases show the advantage of our proposed approach.The average recognition accuracy of our method is 70.29% in unweighted accuracy(UA)and the corresponding performance improvements are 1.06 compared to the best baseline methods.The weighted accuracy(WA)was 70.98%,which was 2.88% higher than the existing methods.In multimodal sentiment analysis,existing models usually perform forced word-alignment before the neural network training to settle the issue of unaligned multimodal sequential data.Without forced word-alignment,this thesis designs the Cross-modal Attention Mechanism with Sentiment Prediction Auxiliary Task(CAM-SPAT)model.The core of the CAM-SPAT is weighted cross-modal attention mechanism,which not only captures the temporal correlation information and the spatial dependence information of each modality,but dynamically adjust the weight of text modality and other modalities to better recognize different emotional expressions.Our model gets a new state-of-the-art record on the CMU-MOSI dataset and brings noticeable performance improvements on all the metrics.For the CMU-MOSEI dataset,the 7-class task and the regression task of our model are still the highest among all models and the proposed model is only lower than the DISRFN model with aligned data on the accuracy the F1 score of the binary classification,showing the great performance of the suggested method.Meanwhile,the overall performance of the model is evaluated in the dataset of English learners’ reading pronunciation collected from our school,and satisfactory results are obtained. |