Video is one type of multimedia which carries massive information,thus it becomes a challenging and meaningful problem for computers to understand the content of videos fast and accurately.This work focuses on the video question answering task,which requires choosing the most accurate answer given a video and a question.This task is easy to verify the performance,which offers a chance to better explore solutions of understanding video content.Most of existing methods are based on static image features and utilizing simple models.However,these methods cannot avoid two problems.First,they may not learn the continuity of video frames well,by just using static features as input sequentially.Second,they may lose important information during the learning process when the input sequence is long,with just simple recurrent neural networks.To tackle with these two problems,this work uses dynamic video features learned by several continuous video frames,and designs a multiple-level attention neural network.This design can focus on multiple granularities of the question in the learning process,and capture more complete information of videos to reserve the best answer.With this method,we obtain the best performance comparing to all known methods on two reliable datasets.Furthermore,we verified its practica-bility by looking into detailed parameters of our neural network. |