Font Size: a A A

Research On Short-message Text Clustering

Posted on:2012-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:2248330395485138Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the information era, Instant Messaging is widely used with the development ofcommunication technology, and huge amounts of short messages data are accumulated.There are a large number of valuable information resources contained in the shortmessages data. It is significant for information management and information retrievalabout data collection,storage, analysis and mining.Dynamic, interleave, informality and large scale are the typical characteristic ofshort message. These characteristics bring challenges for data mining. Against thebackground of short message mining, this paper focuses mainly on short messageclustering technology and so on, involving short messages preprocessing,conversation detection, similarity measurement and clustering algorithm. Thisresearch focuses on the latter two. And try to enhance the accuracy and scalability forclustering, provide useful application for clustering output.The main research of this paper can be summarized as follows:Firstly, a system for short message clustering is developed. The system consistsof data collection, data storage, clustering nodes and output nodes. This thesisdescribes the structure of the system and analyzes the function of every part. And thestudy work is mainly on the problems of data collection, including: how to collect,can differentiate the message according to time, how to extract a conversation, and soon. The system is the fundamental work of the whole thesis.Then, the short text similarity measure based on semantic is put forward. Thesimilarity measure is based on Hownet, calculate the semantic distance of words byHownet, get the similarity of words, and calculate the text similarity with termweights. This way can solve similarity drift resulted from sparse key-words in shorttexts.About short message text clustering algorithm, this paper proposes a hybridclustering algorithm called SMHC combining by frequent term sets and Ant-Treealgorithm. The text clustering algorithm based on frequent term sets has the very highefficiency for avoiding the high dimensionality vector operation. And about Ant-Treealgorithm, the cluster results get by this algorithm are more close to the real dataclassification. And the algorithm possesses the data with high performance because itis based on the tree structure. The hybrid clustering algorithm, which takes theadvantage of efficiency of processing text data based on the frequent term sets clustering, produces the initial cluster, then eliminates the overlap text documents inthe initial cluster by calculating silhouette coefficient. And further refines the clusterby Ant-Tree. Thus gets the high quality clustering results. And the results that retainthe tree structure can provide more abundant information for application.Finally, the short message text mining system for chat software is designed. Thisthesis introduces the overall structure of the system, expounds the function structureand design implementation of each module.
Keywords/Search Tags:short message, semanteme, short texts similarity, text clustering, frequent term sets, Ant-Tree
PDF Full Text Request
Related items