Recently,DCNN has demonstrated excellent performance in a large number of computer vision tasks and has replaced traditional machine learning methods such as SVM.However,the current DCNN highly depends on large-scale human-annotated datasets,which are timeconsuming and expensive to collect-As a result,current datasets often have a limited number of pictures and are far from the capacity of modern CNN models.Self-supervised learning,as a subset of unsupervised learning,has attracted more and more attention for it can train models without human annotations.Self-supervised learning has already shown its potential and surpassed its supervised learning counterpart in lots of computer vision tasks.In this work,we propose two novel and effective self-supervised learning paradigms based on two common challenges that existed in contrastive learning.At first,we find that existing contrastive learning methods can only provide spatial consistency but not temporal consistency,which can cause instability during training and eventually lead to catastrophic forgetting.To this end,this work proposes temporal teacher consistency(TTC).It introduces the temporal consistency into contrastive models by temporal teacher,and uses knowledge transformer to dramatically learn the importance between different teachers.And we propose temporal loss to maximize the mutual information between the student and the temporal teachers.Secondly,we propose a self-supervised algorithm based on residual augmentations(ResAug),to alleviate the over-reliance of contrastive learning for data augmentations.Residual augmentations include two feature space augmentation,namely self residual and cross residual,which can respectively improve the intensity and diversity of data augmentation.This algorithm can obtain numerous and diverse data augmentation samples with a little computational overhead.Finally,we provide extensive experiments and analyses to verify the effectiveness of TTC and ResAug.Both of the two algorithms achieve state-of-the-art performance on the linear protocol on the ImageNet dataset.And the transfer learning on object detection and instance segmentation on MS COCO and PASCAL VOC datasets shows the generalizability and practicality of our methods. |