Font Size: a A A

The Application Of Barrage Comments In The Detection Of Highlight Clips In Variety Show Videos

Posted on:2021-03-06Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhengFull Text:PDF
GTID:2518306302974229Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The development of online audio-visual services has enriched people's entertainment life,and also made the barrage comment gradually popular.From another perspective,the barrage can be regarded as the weak annotations of the video content by multiple annotators,because the barrage sent by the user during the barrage are closely related to the content of the clips they interested during watching video.A large number of deposited barrage comments contain rich video semantic information.Using barrage data can help with the task of video semantic understanding.The main goals of this research are twofold: first,exploring the application of barrage data in the variety show highlight segment detection task;second,researching the method of multi-modal detection of variety show highlight segments in barrage features and other kinds of features.Aiming at the first goal,this article proposed to use a sliding window with a delay in time to extract the word embedding of the original barrage text within a certain length after the time stamp of the video frame,and then calculate average of these embedding as the video frame barrage's text embedding.7-dimensional barrage structure features,the average length of the screen,the density of the barrage,the number of barrage of special colors,the number of barrage of special font size,the number of comment's likes,the number of comment's dislikes,and the number of replies,are generated.This article used barrage's text feature and structure feature as input in the variety shows' highlights detecting with boundary sensitive networks.Aiming at the second goal,this article proposed a stacked boundary sensitive network.By transforming the boundary sensitive network and stacking the temporal evaluation module,different features are trained separately to output a sequence of convolutional network model action,starting and ending probability sequences and weighted average to obtain a multi-modal probability sequence,and then extreme values are extracted from the probability sequence.combing timestamp with high start probability or end probability,a set of candidate proposals is given,and the confidence score of the proposal is calculated according to the probabilistic sequence features of the candidate proposal.Finally,post-processing such as non-maximum suppression is performed to retain the top 200 candidate proposals with the confidence scores for each video as the prediction results of the model.In empirical research,this article selected 93 long videos of 18 variety shows broadcast on the domestic integrated video platform i Qiyi during 2015-2018 and used video image features and barrage features for multiple groups of comparisons.The highlights of these variety shows were professionally annotated by the Baidu brain team in the Video Highlight dataset.At the same time,the dataset also provided video image features extracted from the Res Net101 network,which was used in the image feature modeling part in this study.In the data preprocessing part,this article cleaned the original video barrage data of the crawled video to remove the abnormal barrage.According to the characteristics of the barrage,emoji in text were replaced by synonyms.The word segmentation algorithm segmented the barrage text,generated a barrage word vector based on the pre-trained 200-dimensional Chinese word vector,and averaged to obtain the barrage sentence vector.Then the article used a sliding window with a delay length of 5seconds and average the barrage sentence vectors in the window to construct the video frame barrage semantic features.Besides,the article used the sentiment dictionary to extract the 7 emotional biases of the barrage and summarize them as structural features..In this paper,five sets of experiments were designed to obtain comparison results intuitively.They are uniform random(random guessing),image boundary sensitive network,barrage boundary sensitive network,boundary sensitive network with spliced feature,and stack boundary sensitive network.These networks are trained by fine tune.The pre-trained model is derived from VH data.Boundary sensitive network trained on the image features of 1300 videos.This paper used a five-fold cross-validation evaluation model,that is,the results of the five tests of each model are stitched to obtain the final output of the model.In the model evaluation part,in order to more fully evaluate the model,this article used two indicators.One is the mean average accuracy(m AP)indicator,which has been widely used in the target detection field to evaluate the candidate proposals given by the model overlap at different times.Comprehensive performance on recall and accuracy under the threshold of degree.The second is the area under the average recall rate(AR)curve to assess the recall levels of the models with the highestconfidence.The evaluation results showed that,first,the use of barrage features to detect video highlight segments can achieve a certain accuracy,and second,the image feature model incorporating the barrage feature has a higher effect than the model effect using only the image features.With the same characteristics,the stack-boundary sensitive network is better than the border sensitive network of spliced input.Finally,in order to further explore the differences between the characteristics of barrage and image features,this article compared and analyzes the results of barrage sensitive networks and image sensitive networks.The 100 set of proposals with the highest confidence in the video were tested by Mann-Whitney U rank sum test of two independent samples,and it was found that the barrage characteristics are more sensitive to shorter highlight segments,higher barrage density and video popularity Can significantly improve the quality of barrage,in these aspects barrage characteristics can be a good complement to image features.To sum up,the research in this paper has reached the following three conclusions:First,the application of the barrage in the video highlight detection task is feasible and effective.Using the barrage to detect the highlights of the variety show video can obtain a certain accuracy.Second,considering the differences in the audience of different platforms and videos,the quality of the barrage is uneven,and the application of the barrage should be used as a supplement to the characteristics of the video itself,which is equivalent to the feedback information of the audience.Third,when the characteristics of the barrage and other modalities are fused,a strategy using stacked networks is proposed in this paper,which achieves better results than simple splicing of features.This strategy is not limited to the image features used in the experiments in this article.To other features such as audio or video artificial features,the conclusion should be the same.
Keywords/Search Tags:Video Highlights Detection, Stacked Boundary Sensitive Network, Muti-Modal Features Fusion
PDF Full Text Request
Related items