Font Size: a A A

Vision-Based Feature Learning,Optimization And Semantic Analysis

Posted on:2022-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:2518306554968589Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,semantic parsing is a hot spot in the scope of computer vision.Learning the deep expression of visual information through convolutional neural network,the method has become mature,but the high-dimensional expression of visual information is different from human's understanding.Therefore,the efficiency of human-computer interaction is improved by semantic analysis of visual information,and the interpretability of robotics,visual retrieval systems are enhanced.The semantic analysis of visual information is inseparable from the features learning and optimization.To overcome the problem that hard samples are difficult to converge in feature learning,this paper proposes a hard samples learning strategy based on neural network.According to the issue of effective feature extraction in video,an effective video feature expression is designed through a custom feature evaluation function and feature optimization algorithm.In the domain of video caption,feature extraction and learning methods are used to analysis the semantic of visual information.The contributions of this paper are as follows:1.Feature learning is the basis for effective analysis of visual semantic information.To improve the learning efficiency of hard samples and reduce noise interference caused by superfluous hard samples in deep hash algorithm,a generic strategy called Loss to Gradient for hard sample learning is proposed.First,a non-uniform gradient normalization method is designed to elevate the learning capability of models for hard samples.Back propagation gradients are weighted by calculating the loss ratio between hard samples and all samples.Furthermore,a weighted random sampling method is designed for accuracy improvement with superfluous hard samples.According to the loss,training samples are weighted and under-sampled for noise filtering and a small numbers of hard samples are retained to avoid over-fitting.Based on open datasets,the average accuracy of hash feature retrieval is increased by 4.7% and 3.3%,respectively.Experimental results show that the improved method outperforms other benchmarking methods for the accuracy,proving that the feature representation of hard samples in the dataset is effectively learned.2.The relevance of visual feature and human semantic understanding is improved by feature optimization.To obtain video face recognition features that are not affected by angle and illumination blur,a feature optimization algorithm based on the minimum relational distance error is proposed.First,in view of the defect that the traditional average pooling method cannot effectively distinguish the significance of features during feature fusion,this paper calculates the neurons and connection weights in the static facial feature model to evaluate the effective information of the features.To distinguish the significance of different facial image features,the relationship distance error evaluation function is constructed by combining the effective information and the relationship between features.Then,an unsupervised feature optimization algorithm is proposed based on the evaluation function.The weight of features is automatically modified by the feedback of the evaluation function,and the weight of interference image features is reduced to extract a more effective representation of face in video.Compared with the average pooling method,the method proposed in this paper increases the top-1 rate on the YTF,IJB-A,and IQIYI datasets by 1.03%,3.48%,and 2.44%,respectively.Experiments have proved that the discriminative features of face in video are extracted by the proposed method and the accuracy of video face recognition is improved.3.Video caption algorithms usually input the entire video as temporal features,ignoring semantic differences between different time periods,which lead to temporal semantic conflicts.To solve this issue,a temporal semantic segmentation method is proposed.First,a convolutional neural network is used to extract the spatial features of each frame of the video.Then,feature clustering is performed according to the similarity of adjacent features in time series to ensure that subsets of similar features have the same subject characteristics.Next,filter the noise features according to the proportion of the feature subset and the video feature set.Finally,the most appropriate semantic subset is chose as the input features of the video caption through the feature evaluation function.In the MSVD and MSR-VTT databases,the METEOR score in this paper has increased by1.19% and 2.53% respectively compared with the reference method.Experiments have proved that the temporal features of video are effectively divided by proposed method and the accuracy of video caption is improved.
Keywords/Search Tags:computer vision, deep learning, deep hashing, face recognition, video understanding
PDF Full Text Request
Related items