Font Size: a A A

Research On User Text Clustering Based On Contrastive Learning

Posted on:2022-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:X D TangFull Text:PDF
GTID:2518306776492654Subject:Internet Technology
Abstract/Summary:PDF Full Text Request
Large-scale textual data not only enriches people's daily life,but also brings challenges of how to better manage them and mine their value.As an unsupervised learning method,cluster analysis provides a solution that does not rely on label information,but only relies on the characteristics of the data itself to identify data distribution patterns.Text clustering is an important branch of natural language processing and has already had some successful applications.For example,it can automatically classify a large number of texts posted by users on question-and-answer platforms and social media into different topics,thereby reducing the burden on platforms?Papers with ambiguous author names can be automatically grouped into publication sets by different authors,etc.In this paper,different methods of data augmentation combined with contrastive learning are used to alleviate the problem of data sparseness in text clustering,and different text clustering framework is designed for the additional information available in the two scenarios.Specifically:For text on User Generated Content(UGC)platforms,the author(ID)of the text is usually public.Starting from statistical characteristics,this paper verifies that each author only pays attention to limited topic categories,and publishes texts under related topics,which illustrates the rationality of considering the author when clustering user texts.And based on such observations,a user text clustering framework CAT is proposed.CAT considers the author's influence in terms of both text representation and clustering objective function.The objective function is in the form of contrastive learning,combined with deep representation learning,and the augmented data comes from a fusion of cluster-level attention representation and the author's representation.In previous work,word replacement is performed on the text to augment data,which results in the loss of text-centered words.CAT address this problem in user-generated text clustering.CAT significantly outperforms other text clustering models with and without authors on dataset with authors(IDs),illustrating the benefits of considering author roles on user-generated text clustering and the effectiveness of the proposed method.Aiming at the problem of unsupervised papers name disambiguation on publications management platforms,this paper proposes an end-to-end heterogeneous network-based contrastive clustering framework HINCC for the first time.Fully considering the coauthors,co-organization,and citation relationships between papers to establish a Paper information Heterogeneous Network,HINCC applies node self-masking and edge-based masking as two views to do data augmentation,applying contrastive learning combined with heterogeneous graph neural network encoder,so that the information of different types of neighbors around the node is fully interactively fused to form node representation,and the representation of nodes with similar neighbor information is closer?Clusteringgenerated pseudo-labels participate in the dynamic sampling of negative samples in contrastive learning,which reduces the performance degradation problem caused by negative samples of the same category.The effectiveness of the proposed method is demonstrated on three datasets in different domains.In summary,starting from the additional clustering clues available in different text clustering scenarios,this paper constructs targeted data augmentation method,and explores the ability of applying contrastive learning to guide the gain of these generated positive and negative representations to clustering.The effectiveness of the proposed method is verified on multiple public datasets.
Keywords/Search Tags:Unsupervised Learning, Contrastive Learning, Text Clustering, User Text Mining, Name Disambiguation
PDF Full Text Request
Related items