Font Size: a A A

Research On Short Text Clustering Based On CSUAP And TextRank Algorithm

Posted on:2019-05-12Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhuFull Text:PDF
GTID:2348330542972651Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularization and development of the Internet,people acquire and produce information on various network platforms.A great number of web short text has accumulated in the web platforms.These web short text contains abundant information and mining information from the short text has important research significance.Text Clustering is an automated data mining technology that assign the similar text into the same cluster,and extracts information from the text cluster can quickly show people the various topics and domain information contained in the text.Unlike the traditional clustering of long text,short text has fewer words and more scattered content.According to the characteristics of short text,we proposed the method of short text clustering and the information extraction from the short text cluster.The specific research contents are as follows:(1)We proposed a weight calculation method of feature word,and called it CO-TF-IDF in this paper.CO-TF-IDF added association semantic weight which based on word co-occurrence relationship to strengthen the association semantic information between feature words,and improved the quality of short text clustering.(2)We used latent semantic analysis method to reduce the dimensionality of text features,filter redundant information and overcome the shortcomings of vector space model in processing of synonymous and polysemy.(3)There were a lot of noisy texts(the texts with no subject attribution)in the actual short text clustering,and it was difficult to determine the number of clusters in advance.To overcome these problem,we proposed an improved rough set clustering algorithm(CSUAP algorithm)for short text clustering.CSUAP algorithm added the filtering operation of the noise text data and the iterative merging process of the upper approximation set based on the original algorithm(CSUA algorithm).(4)For the short text cluster obtained after clustering,we proposed a short text cluster information extraction method which combining representative text and keyword tags.Firstly,we extracted representative texts in the cluster based on the ranking results of the TextRank algorithm,and then we extracted the keywords with the largest comprehensive weight,and made the keywords to be the labels of the cluster.People can quickly understand the theme information of the cluster with the help of the keyword tags and get more semantic information from the rep-resentative texts.(5)We designed and implemented a visual short text clustering analysis system based on the short text clustering and cluster information extraction methods proposed in this paper.The system can cluster the collected short text data sets and extract the representative texts and word labels in each cluster.
Keywords/Search Tags:Short Text Clustering, CO-TF-IDF, CSUAP, Cluster Information Extraction, TextRank
PDF Full Text Request
Related items