Font Size: a A A

Multi-modal Video Scene Segmentation Algorithm Based On Deep Network

Posted on:2021-09-19Degree:MasterType:Thesis
Country:ChinaCandidate:X H SuFull Text:PDF
GTID:2518306467471794Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
As the basis of multimedia information classification and recognition,Video scene segmentation is an important link in content-based video retrieval,which plays an important role in understanding video data.It takes the lens as the research object,and divides the similar lens cluster into the same scene according to the lens characteristics and temporal correlation.The traditional scene segmentation method is only based on the underlying features of the video,and does not fully consider the semantic information contained in the video content,which leads to the low accuracy of the scene segmentation.Chasanis and others use human spectral clustering and low-level color feature clustering,marked according to the cluster of shots represented by key frames,and use A NW algorithm to detect scene boundaries from alignment scores of symbol sequences.But when two adjacent scenes are similar and follow the same rules,it is easy to cause scene segmentation error.Sidiropoulos' s introduced the STG(Shot Transposition Graph)approximation using features of visual and auditory channels.However,when clustering the lens with similar visual lens with similar visual and auditory features,which is not conducive to scene segmentation.Aiming at the problems of using low-level color feature clustering,introducing STG approximation to calculate lens similarity and multi-feature fusion,based on the characteristics of temporal association symbiosis between video multimodal,based on deep network,a multi-modal video scene segmentation algorithm is shown in this paper.Firstly,extract the multimodal features of the video.It not only extracts rich underlying features such as vision,audio and text from each lens,but also combines the semantic conceptual features of visual concepts and text concepts.Secondly,the architecture based on triple depth network is proposed to select triple loss triples and train appropriate convolutional neural network models.Then the whole feature vector in series of each lens feature vector is used as the input of the triple depth network and the embedding space learning is carried out,and the measure of semantic similarity is obtained by calculating the distance between the whole feature vectors of two lenses.Finally,the sum of squares of the distance in the time period is minimized to cluster the lens,and the scene at the semantic level is obtained in the end.The experimental consequences conclude that the recommended algorithm can segment video scenes effectively and has good performance in scene segmentation accuracy.The F value of the comprehensive measurement index reached 86.24%,which was 12.17% higher than that of Chasanis' s,which used the combination of the underlying color features and the NW algorithm,and 8.96% higher than thatproposed by Sidiropoulos' s and others,who proposed a method of combining visual and audio features in STG.The recall rate and precision rate reached 85.83% and86.81%,respectively.
Keywords/Search Tags:scene segmentation, multi-modal, deep network embedding, triple loss learning, time-constrained clustering
PDF Full Text Request
Related items