Font Size: a A A

Multiple Target Tracking And Video Sequence Temporal Alignment For Social Scene Understanding

Posted on:2018-08-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X WangFull Text:PDF
GTID:1368330563995796Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Over the past few decades,with the rapidly increasing power of the microprocessor and artificial intelligence techniques,robots have dramatically increased their potential as flexible automation tools.The primary goal is to make the robot learn and reason about how to behave in response to complex goals in a complex world,such as a human social scene.In a social scene,human frequently interact with each other while sending visible social signals,such as facial expressions,gestures,and gaze movements.Such scenes are very common in our daily life and,increasingly,artificial agents are entering these scenes.Classic scene understanding has focused on understanding the structure context that characterizes geometric relationships in a scene.For artificial agents to cohabit these scenes with humans as collaborating team members,it is necessary that they understand social context,such as what people cognitively attend to,and whom people are interacting with.Social context in a scene is often time-variant and it emerges in the form of the motion of the scene.Individual motions are primary social signals that spontaneously arise during social interaction.Each individual motion may affect the motion of others and vice versa.Within the complex interaction between individuals,the group motion reflects their agreement.Thus,individual and group motions are highly correlated with social context and motion estimation is a key component in understanding social context.In this thesis,we establish a computational basis towards understanding the relationship between individual motion and social saliency.Accurate,robust tracking of multiple targets is vital for any further study of individual motion.Simultaneously keeping tracks of multiple targets in a crowed environment remains a challenging task in computer vision,due to highly frequent occlusions and unfixed number of targets.Moreover,3D reconstruction of a dynamic scene from features in more than one camera usually requires synchronization and correspondence among the cameras.However,finding correspondences is challenging,especially between views that are separated by a wide baseline.Also camera motion will increase the complexity of the problem.Concerned on social scene understanding,both multiple target tracking and synchronization of video sequence from free-moving cameras are deeply studied in this dissertation,which includes the following contributions:?1?A novel multiple target tracking algorithm based on sparse coding is proposed,where the input is a set of candidate regions in each frame,as obtained from a state-of-the-art object detector.The method adapts to the changing appearance of objects,due to occlusion,illumination changes and large pose variations,by incorporating a?1 minimization-based appearance model into the Maximize A Posterior?MAP?inference.The robust tracking performance of our approach has been validated with a comprehensive evaluation involving several challenging sequences and state-of-the-art multiple object trackers.?2?A figure-aware multiple target tracking method is proposed,which incorporates the figure/ground repulsive forces in a simultaneous detectlet classification and clustering framework.Without depth/disparity,fine-grained trajectlets tend to cause under-segmentation of similarly moving objects or over-segmentation of articulated objects into rigid parts.Pose estimation,though not accurate,is often sufficient to segment human torso from its backgrounds and induce figure/ground repulsions,which could reduce the risk of both under-segmentation and over-segmentation.Figure-aware mediation encodes repulsive segmentation information in trajectory affinities and provides more reliable model aware information for detectlet classification.The experimental results show that the approach is capable of tracking objects through sparse,inaccurate detections,persistent partial occlusions,deformations and background clutter.?3?Assuming that the features are tracked throughout the whole sequences and the feature correspondences across sequences are known,a joint spatio-temporal video alignment algorithm is proposed.The sequences are recorded by uncalibrated video cameras which are either stationary or free moving of the same dynamic scene.We generate pulse images by tracking moving objects and examining the trajectories for changes in speed.We also integrate a rank-based constraint and the pulse-based matching,to derive a robust approximation of spatio-temporal alignment quality for all pairs of frames.By folding both spatial and temporal cues into a single alignment framework,the nonlinear temporal image-to-image mapping is established using a graph-based approach.The experimental results show that the proposed approach provides a superior performance over existing techniques.?4?A novel method for synchronizing an arbitrary number of videos captured by free moving cameras of the same dynamic scene is proposed,which can use both short and long feature trajectories.Moreover,point correspondences across sequences are not required,or even it is possible that different points are tracked in different sequences,only if they satisfy the assumption that every 3D point tracked in the second sequence can be described as a linear combination of a subset of the 3D points tracked in the first sequence.Assuming the 3D spatial poses of the cameras are known for each frame,we first reconstruct the 3D trajectory of a moving point using the trajectory basis-based method.The trajectory coefficients are computed for each sequence separately.Then we propose the use of a robust rank constraint of the coefficient matrices to measure the spatio-temporal alignment quality for every feasible pair of video fragments.Finally the optimal temporal mapping is found using a graph-based approach.We verify the robustness and performance of the proposed approach on synthetic data as well as on challenging real video sequences.On the basis of thorough study of multiple target tracking and synchronization of free-moving cameras,we integrate the former algorithms into the social scene understanding system,and building two prototypes:camera-centered 2.5D map estimation and 3D reconstruction of social saliency.Such prototype applications are created to demonstrate the feasibility and benefits of integrated solutions.
Keywords/Search Tags:Social scene understanding, First-person camera, Multiple target tracking, Video alignment, Camera synchronization
PDF Full Text Request
Related items