Research On Short Text Clustering Based On CSUAP And TextRank Algorithm

Posted on:2019-05-12

Degree:Master

Type:Thesis

Country:China

Candidate:J W Zhu

Full Text:PDF

GTID:2348330542972651

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the popularization and development of the Internet,people acquire and produce information on various network platforms.A great number of web short text has accumulated in the web platforms.These web short text contains abundant information and mining information from the short text has important research significance.Text Clustering is an automated data mining technology that assign the similar text into the same cluster,and extracts information from the text cluster can quickly show people the various topics and domain information contained in the text.Unlike the traditional clustering of long text,short text has fewer words and more scattered content.According to the characteristics of short text,we proposed the method of short text clustering and the information extraction from the short text cluster.The specific research contents are as follows:(1)We proposed a weight calculation method of feature word,and called it CO-TF-IDF in this paper.CO-TF-IDF added association semantic weight which based on word co-occurrence relationship to strengthen the association semantic information between feature words,and improved the quality of short text clustering.(2)We used latent semantic analysis method to reduce the dimensionality of text features,filter redundant information and overcome the shortcomings of vector space model in processing of synonymous and polysemy.(3)There were a lot of noisy texts(the texts with no subject attribution)in the actual short text clustering,and it was difficult to determine the number of clusters in advance.To overcome these problem,we proposed an improved rough set clustering algorithm(CSUAP algorithm)for short text clustering.CSUAP algorithm added the filtering operation of the noise text data and the iterative merging process of the upper approximation set based on the original algorithm(CSUA algorithm).(4)For the short text cluster obtained after clustering,we proposed a short text cluster information extraction method which combining representative text and keyword tags.Firstly,we extracted representative texts in the cluster based on the ranking results of the TextRank algorithm,and then we extracted the keywords with the largest comprehensive weight,and made the keywords to be the labels of the cluster.People can quickly understand the theme information of the cluster with the help of the keyword tags and get more semantic information from the rep-resentative texts.(5)We designed and implemented a visual short text clustering analysis system based on the short text clustering and cluster information extraction methods proposed in this paper.The system can cluster the collected short text data sets and extract the representative texts and word labels in each cluster.

Keywords/Search Tags:

Short Text Clustering, CO-TF-IDF, CSUAP, Cluster Information Extraction, TextRank

PDF Full Text Request

Related items

1	Research On The Optimization Of TextRank Keyword Extraction Algorithm And SOM Text Clustering Model
2	Research On Short Text Automatic Summarization Algorithm Based On TextRank And Word2Vec
3	Research On Short Text Clustering And Cluster Description Method
4	Short Text Clustering Method Based On BTM
5	Research On Chinese Text Summarization Method Based On Improved TextRank
6	Reasearch On The Topic Clustering Of Network Short Text
7	Social Media Short Text Clustering And Its Applications
8	Research And Implementation Of Campus Search Engine With Entity Analysis And Short Text Clustering
9	Research On Conversation Extraction And Analysis Of Short Text Message Stream
10	Design And Implementation Of Algorithms And Applications For Cluster Analysis To Short Text Data