Font Size: a A A

Research On Video Semantic Parsing Based On Deep Learning

Posted on:2024-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:M X ZhouFull Text:PDF
GTID:2568307157482224Subject:Master of Electronic Information (Professional Degree)
Abstract/Summary:PDF Full Text Request
As an advanced representation of visual semantic parsing,video description aims to describe the semantic content of a video with a clear and unambiguous sentence.In recent years,with the rapid development of Internet technology,the widespread emergence of mobile phones and user-oriented camera equipment has enabled the rapid production,storage and upload of video data.However,the amount of these data is huge and disorderly.How to use machines to quickly and efficiently analyze the semantic information contained in the video,effectively organize and manage the huge video data,and provide effective reference for tasks such as classification and retrieval has become a problem for computers.Problems that need to be solved urgently in the field of vision.At present,the video content description task still has the problem of incomplete and inaccurate semantic content of the generated description.In addition,the lack of Chinese training samples also increases the difficulty of Chinese description.Aiming at the above problems,research on video semantic analysis based on deep learning is carried out.First,to solve the problem of incomplete and inaccurate semantic information in English generated descriptions,a video description method based on sentence semantics and length loss calculation is proposed.Then,based on the English video description method,a Chinese video description method based on cosine attention and semantic selection is proposed,which solves the problem of semantic redundancy and semantic inaccuracy in Chinese description.Then,based on the research on video Chinese description,a Chinese video description method based on adaptive feature selection and fusion is proposed,which further improves the semantic quality of the generated Chinese description.The content and contributions of this paper are as follows:1.Aiming at the semantic incompleteness and inaccuracy of the generated English description,a method of video description based on sentence semantics and length loss calculation is proposed.First,a new length loss function is designed to adaptively adjust the error penalty by measuring the distance between the predicted sentence length and the reference sentence length,so that the model can learn the optimal description length distribution in highly similar visual content,thereby improving the generation Describe the completeness of semantic information.Secondly,a description generation loss function based on sentence semantics is designed.By comparing the prediction and reference descriptions at the sentence level,the model iteratively obtains the optimal sentence semantic description,thereby improving the accuracy of generating description semantic information.This method is tested on two data sets,MSVD and MSR-VTT,and the performance indicators are significantly improved,all of which are better than the current advanced models.Among them,the BLEU@4 and METEOR indicators are particularly improved on the two data sets,indicating that this method is very effective in improving the completeness and accuracy of the semantic information of the description content.2.Aiming at the problem of inaccurate semantics in the generated Chinese description,a video Chinese description method based on cosine attention and semantic selection is proposed.First of all,a scaled cosine attention network is designed to calculate the cosine similarity between the query and the key matrix,and then zoom in through learnable parameters,so that the model can adaptively focus on the correct visual semantic features and improve the generation of semantically correct descriptions..Secondly,in the decoding stage,a semantic selection network is designed to filter redundant information generated by the fusion of visual semantic features and sentence semantic features,reduce interference,and improve the accuracy of model semantics.Finally,the video English description data set MSVD is extended to the Chinese data set MSVD-C,and experiments are carried out on this data set.The results show that the indicators and actual description content are significantly better than other advanced models,which shows that this method can not only accurately Focusing on visual semantic features,it can also filter redundant information.3.Aiming at the inaccurate semantics of the generated Chinese description,a video Chinese description method based on adaptive feature selection and fusion is also proposed.First,design a feature selection network,first use the attention mechanism to focus on important features,and then use the gating mechanism to selectively retain or discard features,thereby improving the accuracy of the model semantics.Secondly,an adaptive dynamic fusion mechanism is designed to dynamically fuse the visual and motion feature vectors by calculating the weight coefficients of the motion features to reduce the interference of redundant information,thereby improving the accuracy of the model semantics.Finally,the upper test was also carried out on the Chinese data set MSVD-C,and the results showed that the indicators and actual description content were significantly better than other advanced models,among which the improvement of BLEU@4 and CIDEr-D indicators was particularly significant,indicating that this method is reducing Time-dimensioning can not only avoid losing important information,but also adaptively fuse visual features and motion features to avoid redundant information.To sum up,this paper successfully solves the problem of semantic incompleteness and inaccuracy in the video description task,and achieves remarkable results in both Chinese and English language environments.The results of this research provide strong support for the technological development in the field of video semantic analysis,and are also expected to promote the development of multilingual natural language processing technology.
Keywords/Search Tags:video description, sentence semantics, sentence length loss computation, scaled cosine attention, semantic selection, redundant information, feature selection
PDF Full Text Request
Related items