| In Massive Open Online Courses(MOOC),the emotional state of students has a significant impact on their learning outcomes.Numerous studies have shown a significant correlation between students’ emotional state and their learning performance.Specifically,a positive emotional state can facilitate positive cognitive,emotional,and behavioral performances,thereby improving learning efficiency,while a negative emotional state can inhibit cognitive,emotional,and behavioral performances,leading to decreased learning efficiency.Therefore,understanding and managing students’ emotional states are crucial for improving learning outcomes in MOOC.This paper conducts in-depth research on feature extraction and model construction using deep learning techniques,and achieves more accurate emotion recognition by multimodal fusion of eye movement signals and audio-visual features of learning videos.The main research work of this paper is as follows:(1)To address the problem of indistinct emotions in students during the learning process,we propose an adaptive window partitioning method and fine-grained feature extraction method.This method uses the complete emotional fluctuation period as the window for sample partitioning,which can make the emotional fluctuation characteristics in the sample more distinct,thereby improving the quality of the sample.Fine-grained feature extraction is applied to the adaptive samples to extract feature curves with temporal information,which can better represent the samples and increase the distance and discrimination between samples.Finally,a Temporal Convolutional Network(TCN)model is constructed to fully extract the temporal information from the feature sequence,and compared with the traditional Long Short-Term Memory(LSTM)model,the results show that the TCN model is more suitable for this task.(2)To address the issue of insufficient fusion of eye movement features and visual features,this paper proposes a new feature called Feature of Coordinate Difference of Eye Movement(FCDE).This feature combines eye movement coordinate trajectory and video optical flow trajectory,effectively characterizing the level of students’ attention.Additionally,the Pixel Change Rate Sequence(PCRS)is extracted from video images to represent the image switching speed.To address the problem of insufficient mining of deep features and complementary relationships between features,a feature fusion framework called Integration of Deep and Shallow Features(IDSF)is designed.This framework uses a Feature Extraction CNN(FECNN)to extract deep features while retaining shallow features,and fully fuses them.Finally,through a series of experiments,an effective and optimal multimodal emotion classification model is determined,and the important conclusion is drawn that,in MOOC learning scenarios,not only physiological signals such as eye movement features should be considered,but also the learning scene features,namely the image and audio features of teaching videos.The combination of physiological and scene features can improve classification accuracy. |