Affected by the diversity of human actions and the complexity of the environment,a single representation-level feature has limitations in describing the content of actions in videos.The extraction methods and fusion effect of multiple representation-level features have become an important factor,which restricts the discriminability of the feature combination.At the same time,in the classification stage of the representation-level feature combination,the classification network trained by the principle of extreme learning machine has poor robustness.When the changeable actions combined with the complex backgrounds result in a large amount of interference information in the video data,the recognition performance of the overall method will be significantly affected.In order to solve the problems of insufficient content information represented by a single feature,low discriminability of concatenation combination of multiple features,and sensitivity of traditional extreme learning machine classification network to interference information,this paper starts with the extraction method of feature distribution information.Then the problems of the weighted fusion of different representation-level features in the process of action recognition and the poor robustness of network weights of extreme learning machine are studied.The specific research works are as follows:Firstly,to solve the problem of limited use of distributed information between features,two representation methods of features containing local position distribution information and global contour information in the image are proposed.The limitation of measuring the coding contribution coefficient of visual words only based on Euclidean distance is considered.The constraints of the difference in length and angle between visual words and descriptor features are combined in the objective function of the coding method,and the double-constrained coding method is derived.According to two information representation methods,the layered weighting of coefficients and the double-constrained coding are combined to generate word group feature to realize the coding utilization of feature distribution information in the video space.Secondly,aiming at the problem that expansion by column would destroy the regional structure information in the pyramid grid,the pyramid visual histogram is re-divided on the basis of word group features to construct a three-level spatial distribution model of a single visual word.By using three levels of independent training,the distribution map of each level is combined with the principle of the auto-encoder to train the local and global receptive fields,and the auto-encoding convolutional features are extracted.The auto-encoding receptive field network structure is established by concatenating word group features and convolution features to supplement the structural information between the regions of the pyramid model that is missing in the word group features.Thirdly,aiming at the problem that the auto-encoding receptive field network does not have the ability to weight word group features and convolution features,which affects the overall discriminability of the feature combination,a three-hidden-layer weighted learning network is constructed according to method of overall and recursive training.And the addition operation of the direct connection between the hidden layers is used to achieve approximate weighting of features.Simplified on the above network,the three-hidden-layer fusion extreme learning machine is constructed according to the method of independent training,and the dot product operation of the direct connection between the hidden layers is used for feature weighting.Through experimental comparison,a feature fusion network model with the best comprehensive performance is obtained,which provides a guarantee for improving the accuracy of the fusion classification of word group features and auto-encoding convolution features.Finally,aiming at the problem that the weights of the feature fusion network trained by the principle of traditional extreme learning machines are easily affected by interference information,correntropy constraint is introduced into the objective function of weight training.The Lagrange multiplier method is used to solve the cyclic update formula that satisfies the minimization constraint.Aiming at its shortcomings of poor network applicability and long training time,the constraints of difference in length and angle between the hidden layer feature distribution are constructed to change the objective function,and the non-iterative weight update formula under double constraints is derived.By training and updating the weights of each layer in the feature fusion network model,the robustness of the overall network is enhanced,and the purpose of further improving the recognition rate of the overall method for changeable human actions in complex backgrounds is achieved. |