Font Size: a A A

Research On Short Text Clustering And Cluster Description Method

Posted on:2015-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:H Y ShaoFull Text:PDF
GTID:2298330467485829Subject:Information management and e-government
Abstract/Summary:PDF Full Text Request
With the rapid development of Web2.0technology and mobile internet technology, a lot of new internet applications are emerging, like User-interactive Question Answering System, WeChat, Micro Blog, etc. These internet applications are producing large amount of short texts and changing the information display form. The short text is very different with the traditional type of Web page. Brief expression, lack of standardization of the words, Extensive use of buzzwords and rapid growth are the characteristics of short text. Because of this, the traditional text clustering algorithm is not suitable to short text. Therefore, research on short text is meaningful obtaining rich information in short text.In this paper, the key technologies for text clustering have been discussed based on analyzing the characteristics of short text. The main research contents are as followsFirst, A two-stage short text clustering method has been proposed. In order to effectively solve the problem of dynamic growth and huge quantity of short text, we adopt the divide and conquer strategy. First, we set a reasonable window size and make it slide on the short texts. We used the traditional hierarchical clustering method on the short texts within the window and get micro clusters. Then, we use a method based on information entropy to merge clusters obtained between different windows, and do two things during merging:First, in the process of merging, if the cluster do not change server times and it contains a little of short texts, then these short texts are isolated points and be deleted; Second, computing the stable degree of the remaining clusters, if the stable degree reaches a certain level, we believe that such clusters are at a steady state and save them into the final results.Secondly, this paper proposed a method to describe the short text clusters. The method described short text clusters from two perspectives:First, we used the PageRank-based algorithm to sort the short texts in clusters and selected the top-k short texts as representative of the cluster; second, we calculated the weight of words in the top-k short text and selected several words as a cluster label. On the one hand, the top-k representative short texts can enhance the legibility of the clusters; On the other hand, the labels can be used as a kind of cluster identifier and play a similar role as the title of an article.Finally, we built a prototype micro blog system and applied the short text clustering algorithm and results description method proposed in this paper to the system to test the effect of the algorithm in practice. With micro blogs clustering module and clusters description module added, the users can obtained hot topics of micro blogs. Through the study of short text clustering method, it exploited the form of information organization under the background of mobile internet and Web2.0. And it is hoped that these measures would play a positive role in promoting topic discovery and tracking, Internet information supervision, public opinion guidance, etc.
Keywords/Search Tags:Short-text Clustering, Two-stage Clustering, Cluster Description, Short-Text Ranking
PDF Full Text Request
Related items