Font Size: a A A

Research Of Microblogs Clustering Analysis Based On Text Presentation

Posted on:2021-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:P XuFull Text:PDF
GTID:2428330611451421Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text clustering analysis is an important research issue in the text processing branch of data mining field.Unsupervised clustering method can identify potential topic categories from social media,and explore unknown and valuable areas,and ensure the text processing task efficiently in the massive data.It is also widely applied in practical problems such as event extraction,personas and community detection and popular among scholars and engineers.Text representation is crucial to text clustering analysis.Among them,Vector Space Model(VSM)is the most commonly representation model in the text clustering task.However,it exist semantic isolation and sparse features in VSM,making it difficult to accurately measure the correlation between texts.In recent years,some scholars have also measured text similarity based on representation learning,but it still faces insufficient accuracy in unsupervised clustering tasks.Aiming at the above problems and the microblogs clustering task,this paper proposes two improved methods: First,we employ TF-IDF algorithm and external sentiment dictionary to produce vector space representation and sentiment identification.Then improving correlation measure between texts based on Word Embedding model for easing the isolation and sparsity of features in the clustering.Second,for the text clustering task,CIRN,a sentence representation model has been proposed,which is based on the advanced self-supervised representation learning model for learning the text semantic similarity.By learning a kind of more general distributed representation,we can measure the correlation between texts more precisely.In this paper,the two proposed methods draw on the Word Embedding representation and Input-Response model respectively.In order to evaluate the results of the improved text representation methods in the clustering task,the experiments are carried out on the Sina Weibo and Twitter datasets with human annotation.The experiment results show that the improved text representation vector has a better performance on microblogs clustering task,and achieved good scores in both purity and normalized mutual information.
Keywords/Search Tags:Text Clustering, Presentation Learning, Social Network
PDF Full Text Request
Related items