| As a fundamental task in computer vision,video semantic segmentation aims to assign each pixel in video frames a semantic class,which can obtain pixel-level scene parsing results.It is widely and urgently required in intelligent transportation and autonomous driving applications.Different from image data,video data contains rich temporal information and object motion patterns,which can provide an important priori for semantic analysis.However,video data often involves complex contents and is hard to obtain complete data annotations,which leads to difficulty in model training and much computational cost in deployment.Therefore,video semantic segmentation research should take advantage of video data while also address the involved challenges.In recent years,deep learning has achieved huge success in video semantic segmentation,but in practical application scenarios,the current deep learning based video semantic segmentation methods still have shortcomings.According to different application scenarios,existing video semantic segmentation methods can be broadly classified into two main categories.For efficiency-first scenarios,existing methods use optical flows to model pixel-level associations between the key frame and the current frame,and reuse the key frame feature through feature propagation techniques,which can avoid feature extraction on the current frame and improve computational efficiency.For accuracy-first scenarios,existing methods also use feature propagation to temporally align adjacent frames,and then improve semantic segmentation accuracy by fusing multiple features.Apparently,feature propagation is the core technique of existing methods.However,in practical application scenarios,optical flow estimation is errorprone in common situations,such as under-texturing and occluded areas.Therefore,existing video semantic segmentation methods would face the following challenges with erroneous optical flows:1)In efficiency-first scenarios,key frame features will be distorted during feature propagation,directly leading to inaccurate segmentation results;2)In accuracy-first scenarios,erroneous optical flows will result in misalignment among features from adjacent frames and introduce noise in feature fusion,which would reduce the quality of the fused feature.In addition to the above two typical scenarios,label-scarce scenarios are also common in practical applications due to the limitation of labeling cost.However,since existing methods usually rely on a large number of labeled video samples for model training,unlabeled video data can not be effectively utilized and models would also suffer from the overfitting issue,which would affect the model performance.This dissertation focuses on developing deep learning based and effective algorithms to solve the above problems.The main contributions are summarized as follow:(1)To tackle the feature distortion issue in efficiency-first scenarios,a distortionaware based feature correction method is proposed.The method utilizes the shared distortion patterns between image and feature domains and accurately locates distorted areas of propogated features.Besides,the method extracts necessary information from the current frame with a light-weight model,which is used to correct the located distorted area.Experiments demonstrate that,comparing to existing methods,the method can significantly improve segmentation accuracy with low computation cost,especially in situations with a large feature propagation distance.(2)To tackle the feature misalignment issue in accuracy-first scenarios,a feature enhancement method based on spatial-temporal fusion and memory-augmented refinement is proposed.The method considers two perspectives,including improving multi-frame feature fusion and exploring single-frame feature enhancement.First,a transformer-based spatio-temporal fusion module is proposed,which can adaptively fuse pixel features in different spatio-temporal positions and avoid error-prone optical flow estimation.Second,a memory-augmented refinement module is proposed to store typical features(boundary features and class prototypes)from training samples.During inference,the module can adjust error-prone features by moving them towards the corresponding class prototype,which can improve their discriminability.Experiments demonstrate that,the method can effectively improve segmentation accuracy of different baseline segmentation models.(3)To tackle the low utilization of unlabeled data and model overfitting issues in label-scarce scenarios,a semi-supervised learning method with inter-frame freature reconstruction is proposed.The method uses the unlabeled frame feature to reconstruct the labeled frame feature and then uses the semantic label to supervise the learning of reconstructed features,which can provide an indirect semantic supervision on unlabeled video data.The method achieves effective utilization of unlabeled video data to assist model training and alleviates the model overfitting issue,which essentially utilizes the internal relevance characteristic of video data.Experiments demonstrate that,comparing to existing methods,the method can achieve a significant improvement in segmentation accuracy,especially in label-scarce scenarios.In summary,this dissertation dives deeply into the video semantic segmentation task.Considering the characteristics and practical challenges in typical real-world scenarios,we propose several deep learning based and effective methods respectively.The extensive experiments demonstrate that the proposed methods can significantly boost segmentation accuracy among different application scenarios comparing to existing works and alleviate dependence on a large amount of labeled data.In general,the proposed methods can effectively make video semantic segmentation algorithms more applicable in real-world scenarios,which are greatly valuable for practical applications. |