With the rapid development of video acquisition,storage and processing technologies,video data has grown significantly.Video data contains a large number of text targets,and text as the carrier of human information,carries rich semantic information.Detecting and tracking scene text in video has become a key step in many applications,such as video retrieval,video content understanding and autonomous driving,etc.Therefore,the video text tracking task has broad research prospects and application scenarios.However,existing algorithms still face many challenges due to the diversity of video scenes,strong illumination changes,and motion blur.This dissertation takes the Intersection over Union distance and feature vector distance of text targets as the starting point,fully exploits the potential correlation between video data frames,and proposes a highperformance text tracking algorithm based on inter-frame data association.The main work content and innovation points of this dissertation are summarized as follows:1)A text tracking algorithm based on inter-frame spatio-temporal complementary location is proposed.Scenes such as motion blur and illumination changes in the video data increase the difficulty of locating text targets and lead to the break of text trajectories.Aiming at the above problem,this dissertation fully mines the correlation of video data in the temporal dimension and proposes a Siamese Complementary Network.The network utilizes the position information of the text target in the previous frame to locate the target in the current frame,and fuses it with the predicted position probability map of the text detector to obtain the text target bounding box of the current frame.Compared with the baseline algorithm,the MOTA index is improved by 1.9% and 14.82% on the Minetto and ICDAR 2015 Video datasets respectively,which effectively improves the break of text trajectories.2)A text tracking algorithm based on feature metric learning between frames is proposed.The similarity in visual features of text targets brings ambiguity to the matching process,resulting in the switch of track IDs.Aiming at the above problem,this dissertation designs a Text Similarity Learning Network to encode the unique semantic information of the text,and adopts metric learning to constrain the text target feature distance between frames to output the discriminative text target features.The IDF1 index improves by 17.9%and 27.18% on the Minetto and ICDAR 2015 Video datasets respectively compared with the baseline algorithm,which effectively improves the track ID Switch problem.Combining the above two complementary improvement methods,this dissertation proposes a text tracking algorithm based on inter-frame data association,which achieves the best performance of existing detection-based text tracking algorithms on both Minetto and ICDAR 2015 Video datasets. |