Research On Short-message Text Clustering

Posted on:2012-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wu

Full Text:PDF

GTID:2248330395485138

Subject:Computer Science and Technology

Abstract/Summary:

In the information era, Instant Messaging is widely used with the development ofcommunication technology, and huge amounts of short messages data are accumulated.There are a large number of valuable information resources contained in the shortmessages data. It is significant for information management and information retrievalabout data collectionï¼Œstorage, analysis and mining.Dynamic, interleave, informality and large scale are the typical characteristic ofshort message. These characteristics bring challenges for data mining. Against thebackground of short message mining, this paper focuses mainly on short messageclustering technology and so on, involving short messages preprocessing,conversation detection, similarity measurement and clustering algorithm. Thisresearch focuses on the latter two. And try to enhance the accuracy and scalability forclustering, provide useful application for clustering output.The main research of this paper can be summarized as follows:Firstly, a system for short message clustering is developed. The system consistsof data collection, data storage, clustering nodes and output nodes. This thesisdescribes the structure of the system and analyzes the function of every part. And thestudy work is mainly on the problems of data collection, including: how to collect,can differentiate the message according to time, how to extract a conversation, and soon. The system is the fundamental work of the whole thesis.Then, the short text similarity measure based on semantic is put forward. Thesimilarity measure is based on Hownet, calculate the semantic distance of words byHownet, get the similarity of words, and calculate the text similarity with termweights. This way can solve similarity drift resulted from sparse key-words in shorttexts.About short message text clustering algorithm, this paper proposes a hybridclustering algorithm called SMHC combining by frequent term sets and Ant-Treealgorithm. The text clustering algorithm based on frequent term sets has the very highefficiency for avoiding the high dimensionality vector operation. And about Ant-Treealgorithm, the cluster results get by this algorithm are more close to the real dataclassification. And the algorithm possesses the data with high performance because itis based on the tree structure. The hybrid clustering algorithm, which takes theadvantage of efficiency of processing text data based on the frequent term sets clustering, produces the initial cluster, then eliminates the overlap text documents inthe initial cluster by calculating silhouette coefficient. And further refines the clusterby Ant-Tree. Thus gets the high quality clustering results. And the results that retainthe tree structure can provide more abundant information for application.Finally, the short message text mining system for chat software is designed. Thisthesis introduces the overall structure of the system, expounds the function structureand design implementation of each module.

Keywords/Search Tags:

short message, semanteme, short texts similarity, text clustering, frequent term sets, Ant-Tree

Related items

1	Social Media Short Text Clustering And Its Applications
2	Related Technologies Research On Short Message Clustering
3	The Research And Implementation Of Massive Short Message Mining Technology
4	Clustering Algorithm Research Of Short Text Based On Semantic Similarity
5	Research On The Method Of Semantic Similaritycalculation Of Short Texts Based On HowNet
6	Research On The Key Technology Of Short Message Text
7	A Short Texts Matching Methodusing Multi-level Features
8	Short Texts Feature Extraction And Classification Techniques For Supporting Multi-level Semanteme
9	Research On The Method And Exploitation Of Traditional Chinese Micro-blogging Short Topic
10	Message Text Clustering Based On Frequent Patterns