Research On Mix-Supervised Learnig For Human Visual Understandng

Posted on:2022-01-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Yang

Full Text:PDF

GTID:1488306326979929

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

Human visual understanding is an important part in the field of computer vision.As human beings are often the important research objects of multimedia such as images and videos,it is necessary to analyze and understand the human in these mediums.Human visual understanding is the integration of a series of human-related tasks based on computer vision technology.Through the analysis of multiple human information,it can better promote the understanding of human-related content in images and videos.The existing human visual understanding solutions are mainly based on single-dataset multi-task learning or multiple single-task combinations.However,the single-dataset with multi-dimensional annotations is difficult and costly,while the multiple single-task combinations method is inefficient and ignores the correlation between tasks.Therefore,in order to solve the problem of human visual understanding efficiently and accurately,multiple datasets are needed to complete a variety of tasks together.Based on the idea of multi-dataset multi-task learning,this paper proposes a method for human visual understanding:Mix-Supervised Learning(MSL).MSL is a multi-task learning architecture with shared backbone.It uses the region-based convolutional network as the basis,in which different tasks can share the same backbone network,and can process the network branches for specific tasks in parallel.In this paper,MSL is used to train the human visual understanding model on five different datasets,which can simultaneously perform six human visual understanding subtasks,including human detection,human instance segmentation,human parsing,pose estimation,densepose estimation,and instance-level human parts detection.In this paper,aiming at the problems of domain adaptability and gradient competition,solutions such as instance-level transfer learning and gradient equalization are proposed;aiming at insufficient human details and missing global features representation of multi-task learning,a network both suitable for constructing human geometry context information and enhancing global semantic information is proposed;aiming at the target diversity of multiple datasets,a module for inspiring receptive-fields using spatial attention mechanism is proposed.In order to verify the scalability of MSL,this paper also proposes a new human visual understanding subtask and systematically establishes a large-scale accurate annotated dataset.Finally,in terms of the efficiency and accuracy of human visual understanding,MSL outperforms the current non-end-to-end multi-dataset multi-task learning methods and even leads the multiple single-task combinations method.The main innovations of this paper are as follows.�Aiming at the efficiency and accuracy of human visual understanding,based on the idea of multi-dataset multi-task learning,the Mix-Supervised Learning is proposed.MSL is a multi-task learning architecture with shared backbone.It can learn human visual information from multiple datasets by a single model.MSL simultaneously completes six sub-tasks of human visual understanding,including human detection,human instance segmentation,human parsing,pose estimation,densepose estimation,and instance-level human parts detection.In addition,MSL also proposes Gradient Equalization strategy based on loss weighting and Instance-level Transfer Learning strategy based on task relationship,which breaks through the dilemma of insufficient accuracy and efficiency in human visual understanding.�Aiming at the multi-task adaptability of MSL for human visual understanding,Parsing R-CNN is proposed,which is suitable for constructing human geometric and contextual information and enhancing global semantic information.It solves the problem of insufficient human modeling ability and missing global semantic information,significantly improves the accuracy of human parsing and densepose estimation.�Aiming at the multi-dataset robustness of MSL for human visual understanding,based on the relationship between spatial attention mechanism and network receptive field,Attention Inspiring Receptive-fields module is proposed.The Air module can enhance the translation invariance and scale invariance of the network,thus improving the robustness of MSL to multiple datasets.�Aiming at the scalability of MSL for human visual understanding,a new human visual understanding subtask:instance-level human parts detection is proposed,and systematically constructing a large-scale accurate annotation dataset.Instance-level human parts detection is applied to MSL as a subtask,which effectively verifies the scalability.The Mix-Supervised Learning proposed in this paper is fully ahead of the existing solutions in terms of efficiency and accuracy for human visual understanding.Compared with the two non-end-to-end multi-dataset multi-task learning methods,the average accuracy of the six human visual understanding subtasks is about 15%ahead,and the training iterations are reduced by about 67%and 50%;compared with the multiple single-task combinations method,the average accuracy of the six subtasks is about 10%ahead,and the inference speed is about 3.8 times ahead.In general,this paper systematically studies the characteristics and challenges of human visual understanding,proposes the Mix-Supervised Learning based on the idea of multi-dataset multi-task learning,and makes innovative achievements on the key issues of general model,task adaptability,data robustness,and scalability.Through a large number of experiments,the efficiency and accuracy advantages of MSL for human visual understanding are verified.

Keywords/Search Tags:

Human Visual Understanding, Multi-dataset Multi-task Learning, Mix-Supervised Learning

PDF Full Text Request

Related items

1	Research On Key Techniques Of Visual Semantic Understanding
2	Research On Person Re-identification Algorithms Based On Multi-task Learning
3	Multimodal Fine-grained Interaction Modeling For Textual Video Grounding
4	Semi-Supervised Multi-Task Learning Based On DFS
5	Research On Some Problems Of Visual Semantic Understanding
6	Human Detection Under Arbitrary Poses Based On Multiple Instance Learning
7	Multi-Task Joint Optimization For Visual Sentiment Prediction
8	Research On Semi-Supervised Multi-Task Learning Based On Regularization
9	A Study Of Multi-view Learning
10	Research On Person Re-identification Based On Multi-task Joint Supervised Learning