Font Size: a A A

Research Of Classroom Emotion Recognition Based On Audio And Text

Posted on:2021-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:G X YiFull Text:PDF
GTID:2427330605958670Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Emotions play a very important role in human decision-making,interaction and cognition.People urgently hope to automatically and accurately identify human emotions through technical means,and provide effective support for human decision-making solutions.In recent years,with the successful application of deep learning algorithms in various fields such as image,text,and audio,many researchers have also applied this technology to the research of emotion recognition.The classroom is an important application scenario,and researchers are eager to use the data in the classroom scenario to realize the automatic recognition of teachers and students' emotions through machines.Use classroom emotions to reflect learning,and assist teachers in teaching intervention.that means to convert students' emotions into teacher's decision-making suggestions to help teachers carry out precise teaching.For teachers,this will help teachers to reflect after class,and it can also be used as an evaluation basis for teachers' teaching level.In addition,achieving accurate portraits of classroom emotions will effectively promote objective evaluation of the classroom.For current classroom emotion recognition research,there are relatively few related studies.Firstly,some classroom emotion recognition research is based on visual or physiological signals.The collection of visual data and physiological signals is relatively difficult and costly.Secondly,the recognition methods are also more traditional machine learning methods based on statistical theory.Finally,the use of data modalities is relatively simple.Due to the complexity of emotions,the current use of single modalities for effective emotion recognition is still a difficult task.Because the interaction between teachers and students in the classroom teaching process is mainly discourse communication,this study aims to build an emotion recognition model with high recognition accuracy by using the audio and text in the process of communication between teachers and students.In order to carry out the research of classroom emotion recognition based on audio and text,this article mainly completed the following work and innovation:(1)Sort out the research on emotion recognition of audio and text at home and abroad in recent years.This includes emotions theories,data set construction methods,emotion recognition and multi-modal fusion methods,which provide theoretical basis for subsequent research.(2)A classroom emotion recognition data set is designed for classroom emotion recognition tasks of audio and text.First of all,some classroom teaching videos from the same region,the same grade and the same subject are screened from the public education platform of one teacher and one excellent class.Secondly,the audio is separated and the audio is pre-processed in batches.Thirdly,the endpoint is detected for the audio and divide audio into audio samples by the endpoint.And then use Baidu audio recognition API to obtain the text content of the audio samples.Finally,performs multi-person text error correction and emotion labeling.A dual-modal classroom emotion recognition data set containing more than 8,000 voice and text data is initially established.(3)Design different audio emotion recognition models based on different features of audio to complete research on classroom emotion recognition of audio.A classroom audio emotion recognition model based on temporal structure and spatiotemporal structure is designed for the MFCC,prosody and spectrogram features of audio,and experiments are performed on the classroom emotion recognition dataset.Experiments show that both models have their own advantages.Among them,the time series model combining MFCC and prosodic features has the best recognition result for neutral emotion,while the spatiotemporal model using the features of the spectrogram has the best recognition result for the silent emotion.(4)The research on emotion recognition of classroom texts are completed based on the XLNet pre-training model.Firstly,describe the two important text preprocessing tasks of Chinese word segmentation and Chinese word representation.Secondly,introduce the latest XLNet model that has the best results in multiple NLP tasks,and implements emotional recognition of classroom text based on XLNet model.Thirdly,compared and analyzed the four text emotion recognition models,we found that the XLNet-L12 model has about 7 percentage points improvement compared to the native recurrent network model.Finally,the audio and text emotion recognition results are compared,and the results show that the performance of text modal emotion recognition is better than audio modalities on the overall accuracy.However,from the perspective of subdividing emotional categories,they have their own advantages,which inspired us to learn from each other's strengths and seek further improvement through modal fusion.(5)Explore the multimodal fusion emotion recognition method under the feature layer fusion strategy and propose an improved attention mechanism fusion method.Firstly,three multi-modal fusion strategies are compared and analyzed.Feature-level fusion strategies are selected to investigate the audio and text fusion task of classroom emotion recognition.Then,shallow and attention-based fusion models are designed.Finally,the shortcomings of attention-based mechanism fusion method are improved and proposing an improved attention fusion model.Experiments show that the shallow fusion method on the open data has a gap compared with other studies,and the improved attention fusion method achieves the best performance.On the classroom emotion recognition data set,the recognition rate of the shallow fusion model,attention-based mechanism fusion model and improved attention fusion model increases sequentially,and the recognition effect of the three fusion models is better than the single data modal.Among them,the improved attention fusion model had an improvement of about 11 percentage points compared with the audio mode and compared with the text modal,there is an improvement of about 3 percentage points,which shows that multi-modality has more advantages in classroom emotion recognition than single-modality.
Keywords/Search Tags:deep learning, classroom emotion recognition, audio emotion recognition, text emotion recognition, multi-modal fusion
PDF Full Text Request
Related items