Font Size: a A A

Research On Instance Segmentation For Video Scene

Posted on:2022-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z T HuangFull Text:PDF
GTID:2518306569497444Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and artificial intelligence,the processing and analysis of video data is becoming more and more important.It is of great significance to study the method of instance segmentation for video sequence,which is helpful for high-level tasks like video understanding.Video instance segmentation is an integrated task of object detection and semantic segmentation in video field,which aims to extract the location and mask information of objects in video,and it has certain practical value and research space.Compared with the general image dataset,video has more frame images including motion blur,object deformation and mutual occlusion.However,most of the current video instance segmentation models rely on image instance segmentation technology to process a single frame image directly.It is lack of the consideration of the use of spatial-temporal information in video,which makes the performance of object feature poor and precision of instance segmentation mask lower.In addition,video instance segmentation involves the learning of multiple tasks such as detection,segmentation and association.How to design an end-to-end multi-task network structure is also a difficult problem.In order to solve the problem of poor performance of object features and insufficient segmentation precision in video scene,this thesis designs a feature extraction network based on spatial-temporal feature fusion.The spatial-temporal feature fusion module includes multi-scale feature fusion and detected object feature fusion.The multi-scale feature fusion mainly deals with the multi-scale features output from the Feature Pyramid Network.By means of rescaling,element-wise add of semantic similar channel features and residual connection,the global fusion of semantic features of different scales in space is carried out;the detected object feature fusion applies the scale dot-product attention mechanism to calculate the correlation matrix of detected object features in other frames and current frame feature,then fuses the detected object features with weighted addition,increases the temporal information for the current frame.This module strengthens the feature response of the object and suppresses the background information interference in the video scene.On the basis of spatial-temporal feature fusion network,a two-stage video instance segmentation algorithm is proposed to realize the learning of multiple tasks such as detection,segmentation and association.Firstly,the region proposals of the image is extracted through the Region Proposal Network.Then,Point Rend is introduced to improve the segmentation effect,the bounding boxs of proposals are optimized in the detection network head,and the instance matching score is calculated by dot-product similarity and softmax classifier in the association network head.The model is trained with multi task loss function to improve the precision.In addition,because the speed of the twostage algorithm is relatively slow and can not meet the requirements of real-world video scene prediction,a fast single-shot video instance segmentation version is designed.The deformable convolution is introduced and the single-shot network based on YOLACT is used to replace the time-consuming Region Proposals Network,which reduces the stage of proposals generation.And the simple linear combination of the image mask prototype and the mask coefficients improves the speed of model training and prediction.Finally,a series of experiments are carried out on You Tube-VIS and KITTI MOTS video datasets to verify the effectiveness of the proposed algorithm compared to other algorithms.
Keywords/Search Tags:computer vision, object detection, semantic segmentation, instance segmentation, video instance segmentation
PDF Full Text Request
Related items