| As the spiritual life becomes more abundant,people’s entertainment demand for watching movies and TV shows has gradually increased,and short online videos represented by micro movies and sitcoms have also exploded.In this environment,it becomes very meaningful to predict where the audience’s gaze is focused when watching these videos.For example,for an advertising company,through the analysis of the gaze point in the video,the position of the advertising product in the video can be set better and more reasonable.According to whether the audience knows the relevant information of the film in advance,the viewing method can be divided into two types: free-viewing and visual question answering.Almost none of the existing dynamic scene human gaze point data sets are for film and television works,lacking its unique special effects,long shots and other visual factors,and most of them are collected in free-viewing,without giving the prior information in advance.Our research has found that the existing free-viewing saliency models do not have good prediction results when the salient objects appear,disappear or multiple salient objects appear at the same time;while the visual question answering dataset is usually difficult to collect and the amount of data is small,and the existing model is difficult to adapt to small data sets.In response to the above problems,this article takes the prediction of the gaze point of the eyes of the film and television works as the research object.The specific research content is as follows:(1)Visual Free-viewing Eye-tracking Database for Videos(Video VFE)and Visual Question-directed Eye-tracking Database for Videos(Video VQE).We collected dynamic stimuli and set up the corresponding questions in advance,and carried out eye movement experiments in free-viewing and visual question answering modes.After data processing,we obtained a data set containing 14 subjects.Through the visualization of eye movement data,the peak map and heat map were obtained.We discovered the orientation of objects on the human visual system under free-viewing and the influence of video emotions on humans,revealing the relationship between the subjects’ answers and attention in visual question answering.(2)Fixation Prediction Model for Movie in Free Viewing(Movie FV).The model combines top-down and bottom-up attention mechanisms,and is divided into three parts:the video temporal information module,based on the CNN-LSTM framework,is responsible for extracting the basic features of the video frame and modeling the timing characteristics through conv LSTM;The context information extraction module uses the Inception architecture and dilated convolution to obtain global and local context information within the frame at multiple scales;the saliency map fusion module uses the two-classification network modified by VGG-16 to measure the importance of the two saliency maps.On the self-made data set,we compare our model with the other five existing models.The experimental results show that our model prediction results are more accurate.(3)Fixation Prediction Model for Movie in Visual Question Answering(Movie VQA).We added a neural attention module to the original VGG-16 feature extractor and by fine-tuning the neural attention module separately,the general video features and movies special attention features can be combined.Experimental results show that the proposed model performs better than the other three existing models in the self-made dataset. |