| In the past decade,micro-videos have rapidly developed as a new form of media.Tags,as keywords used to describe video content,are widely used to assist in video recommendations and searches.Considering that users often cannot or do not add sufficient and accurate tags when uploading micro-videos,automatic tag generation is of significant importance.Micro-videos are usually created and uploaded by users,and have the characteristics of strong sociality and fast iteration.The former leads to a large number of imitation videos that existing methods find difficult to model,while the latter results in new tags constantly emerging,making it difficult for traditional methods to timely construct topic relationships among them.These two characteristics bring about three challenges:1)Tag relationship construction.Most methods obtain tag relationships from open knowledge bases,which can only cover a portion of old tags.Some methods attempt to build them based on rules,but lack accuracy.2)Behavior propagation modeling.Imitation behavior usually spreads along social networks,which existing tag generation models cannot model.3)Visual-language knowledge aggregation.Models learn tag meanings from two sources of knowledge:visual knowledge from related videos and language knowledge from tag relationships.Existing knowledge aggregation models adopt direct aggregation methods,such as vector concatenation or attention mechanisms,which ignore redundancies in common knowledge and hence have unsatisfactory results.To jointly model social influence and tag relationships,this paper defines micro-video tag generation as a video-tag link prediction problem in the video-user-tag heterogeneous network.Specifically,this paper first proposes a semi-supervised learning-based method for tag relationship construction,which uses statistical information among all tags and a directed acyclic graph prior to construct tag relationships with minimal labeling.Then,this paper integrates tag relationships,video-tag data,and user social networks into the video-user-tag network.Afterwards,to obtain better video and tag vector representations,this paper proposes a heterogeneous graph neural network consisting of gate graph transformers and adversarial aggregation networks,which respectively model behavior propagation and visual-language knowledge aggregation.Finally,the model calculates the vector similarity between each microvideo and all candidate tags in the video-tag network.This paper conducted extensive experiments on real datasets in three categories:fashion,food,and beauty,and verified the superiority of the model over baseline methods. |