Font Size: a A A

Feature Learning Based Human Body Detection And Analysis

Posted on:2018-12-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:B Y ShengFull Text:PDF
GTID:1368330545468903Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
As one of the most popular researches in computer vision and pattern recognition,body detection and analysis have extensive application perspective and huge market demand in video surveillance,intelligent driving safety,and intelligent robotics fields.Body detection and analysis focus on the problems of "where is the person?","what is the person doing?" and"how many persons?".The tasks that analyze the location,action and number of persons in the image or video by computer technology are respectively relative to pedestrian detection,action recognition and crowd counting in the computer vision field.The performance largely depends on the understanding and description of video content so that how to obtain the discriminative representation(namely feature learning)is one of the most key issues.Traditional methods mainly contain low-level descriptor(e.g.the edge,color and shape features)extraction and combine with the bag of words or sparse coding models to generate the final representation vectors.Despite the efficiency of these approaches,it is still difficult to describe effective features in the cases of multi-objective occlusions and complex noise background.Recently many institutions and universities at home and aboard have carried out an intensive research and achieved some improvements;however,there are still some unsolved problems.Aiming at the body detection and analysis task,this paper does a research on learning representative and discriminative feature description around feature learning problems integration with the sparse coding,multiple kernel learning,and deep learning knowledge.The main works and contributions of this paper are summarized as follows:(1)We propose filtered shallow-deep features for pedestrian detection.Traditional ACF detectors apply HOG+LUV feature channels combined with the Adaboost classifier.However,it may reduce the descriptive capacity of features if we simply utilize edge and color information with high-level semantic and context features ignored.Therefore,we propose fusing the semantic segmentation feature map into shallow feature channels of ACF so that shallow-deep features have low-level appearance and high-level semantic information at the same time.Besides,the original ACF detector applies sum pooling method which loses much effective information on the feature map.We manually design various directional checkboard-like filters for convolution operation and the filtered channel responses can capture many high-level abstract features and generate more discriminative representations.The experimental results show that our proposed detector by using filtered shallow-deep features is able to improve the accuracy of pedestrian detection.(2)We propose the RG-MKL learning method to fuse multi-region multi-layer deep features for action recognition.The human body feature description can provide useful information for action recognition and how to reasonably use core features of the human body and context features of the whole image is still an open problem.We fuse the two-region features by a novel multiple kernel learning method which not only fuses the capacity of pre-learned classifiers,but also integrates into the prior knowledge on discriminative capacity of two-region features.Besides,we apply multi-layer deep features which incorporate traditional fully-connected features of the two-stream model and high response features of the convolutional layers,so that the features are more discriminative than general fully-connected or soft-max layers.The experimental results show that the proposed RG-MKL fusing method and multi-region multi-layer deep features are able to improve the representative and classification capacity of video actions and the action recognition accuracy.(3)We propose direction-depend feature pairs and non-negative low rank sparse coding model for action recognition.Based on the method of local spatial-temporal interest points,descriptors for only one point are unable to describe the temporal and space location relationship between two points,and the traditional sparse coding method may lead to code inconsistency and information loss.In this paper,we take the spatial-temporal relationship between interest points into consideration.In detail,we concatenate the descriptors of each interest point and its neighboring point,assign direction labels according to the direction relationship,and construct the direction-depend feature pairs.The new features have more discriminative capacity because they can describe the relative relationship in 3D space and the appearance characteristics of context.Besides,we propose a non-negative low rank sparse coding model to encode the new features.The low rank term can make the similar feature pairs have similar codes,and the non-negative term can avoid the negative codes without physical meaning.The experimental results show that using direction-depend feature pairs and non-negative low rank sparse coding model can improve the action recognition rate in contrast to traditional methods.(4)We propose local features from the high-level semantic attribute feature map for crowd counting.The traditional regression-based method mostly applies the global foreground segmentation features to describe the information of each video frame and ignore the attribute features with high-level semantic ability.However,the traditional high-level attribute method is unable to distinguish the variety of human number and little literature does the related research.Considering that the semantic segmentation feature map is able to reflect the probability of attributes for each pixel and describe the context features better,we attempt to use it to describe the image.In order,to make use of the local information,we further extract the locality-aware features and propose W-VLAD coding method which considers the differences on cluster centers.The experimental results show that using W-VALD to encode the locality-aware features from the semantic segmentation feature map can improve the discriminative power of image representations and the accuracy of crowd counting.
Keywords/Search Tags:body detection and analysis, semantic segmentation features, sparse coding, multiple kernel learning
PDF Full Text Request
Related items