| Egocentric video analysis,analyzing videos captured by first-view cameras,has been an emerging field in computer vision.Recent years,with the availability of wearable cameras like GoPro,Google Glass,Microsoft SenseCam,and their popularity in recording daily life,assisted living,and smart home,the number of egocentric videos has increased largely.Firstperson video involves a large number of object manipulation actions.The main challenge is unrelated regions of interest and cluttered backgrounds.Therefore,the state-of-the-art egocentric video analysis methods mainly use deep network frameworks to divide the hand region and the action area to do activity recognition task.However,In practical applications such as smart home and assisted living,there is a better prospect for activity prediction.The state-of-the-art egocentric video analysis methods use synchronous sequence network frameworks.The limitations are shown as follow: one is they cannot model temporal dependency of long-term,so they cannot predict future events.Second,redundancy and noise have a great impact on the results,they cannot distinguish which frame is more important to attend to.Motivated by these limitations,this thesis proposes an asynchronous and synchronous based two-stream LSTM,using point process as the mathematical model,focusing on the effect of gaze points on the long-term asynchronous event sequence and how to dealing with redundant and noisy frames.Usually,eye movements will reflect a person's thinking process,and people's movements will follow the trajectory of eye movement to a certain extent.Therefore,in this thesis,the asynchronous event is defined as the gaze point moving into/out of the object being manipulated in a video frame,which is closely related to the start/end of the activity.Aiming at the influence of redundant frames and noise on the results,this thesis proposes a attention score model framework.For a video frame sequence,each frame is scored.The decisive factor is the asynchronous event sequence.This event modulated attention model improves the accuracy of the experiments and enhances model robustness.In this thesis,the following innovative work is done:(1)based on the influence of point process condition intensity function,a deep recursive network model combining synchronous and asynchronous modules driven by gaze is proposed;(2)gaze information is not only applied as a visual saliency feature to the synchronization model,but also as an asynchronous event driver,simulating the interaction between actions in a long time sequence;(3)giving a score set to the input frame sequence,to reduce the influence of redundant and noisy frames.We conducted comprehensive evaluations and experiments on the open source of two egocentric datasets GTEA Gaze and GTEA Gaze+.The results show that our egocentric video analysis method is superior to the state-of-the-art algorithms. |