Font Size: a A A

Short Text Clustering Method Based On BTM

Posted on:2015-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:Q L TangFull Text:PDF
GTID:2268330428968666Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and network technology, many network communication platforms are widely used, such as mobile phone’s short message, micro-blog, email, forum, chat software, news commentary. These platforms tend to generate a lot of short texts, which are used to personal communication. These short texts relate to all areas of our life and has generally become a widely accepted information communication channel, which is also gradually changing the way of people communication and living habits. Mining potential resources from these huge number of short texts can not only facilitate the management of them, but also be used for information discovery and analysis. But in the face of such massive short texts, it is difficult to obtain the information resources quickly by artificial ways, so it’s quite significant if we use computer technology to carry out mining and analysis of short texts. Text clustering is one of the most basic technologies in Natural Language Processing. Analyzing and organizing short text by clustering technologies can tap the text internal links, which makes it easy for people to understand and manage the overall information.But for the short text, it has unique features, which is different with the long text, such as fewer words, concise expressions, the lacking of rich contextual information and information included. These make short text features sparse, and it is difficult to accurately extract valid document features. In addition,’the effect of traditional clustering method is poorly for short texts. Therefore, it brings a greater challenge to the study of short text clustering, which also makes the short text clustering technologies develop relatively slow.The main work of this paper:1) This paper elaborates on the research status of short text clustering, research difficulties and commonly used methods. Then, briefly introduces the key technologies of short text clustering, such as word segmentation and removing stop words in the process of pretreatment, several important text models, text clustering methods, document similarity calculations and the evaluation indexes of clustering results and the description of the cluster.2) This paper introduces BTM particularly, analysis and compares the similarities and differences of BTM, LDA and mixture uni-gram model, introduces the BTM semantic space and the process of parameters inference by Gibbs Sampling. BTM-based document characteristics and semantic representation of documents are reflected in experiments and the advantages of BTM in the treatment of short text sparse are summarized in this paper.3) BTM is introduced to the short text clustering. Combining the document-topic probability distribution matrix, topic-word probability distribution matrix which are derived from BTM training and the traditional TF-IDF feature word space to add the theme feature into word feature to improve the quality of short text clustering.4) This paper proposes a cluster description method based BTM by themes of each clustered documents in clustering results and topic-word feature space which are derived from BTM training. In this way, it’s easy to describe and understand clustered results.By analysis and comparison of experimental results which performing K-means clustering on the Corpus which crawled from a popular Chinese Q&A website called Baidu Knows, we have found the method this paper proposed is better than the traditional methods, such as VSM and LDA. And the cluster description of clustering results are more accurate. Thus the validity of the short text clustering based on BTM could be confirmed.
Keywords/Search Tags:short text, text Clustering, BTM, topic model, cluster description
PDF Full Text Request
Related items