Related Technologies Research On Short Message Clustering

Posted on:2009-01-31

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L Wang

Full Text:PDF

GTID:1118360278456609

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet and communication technology, IM (Instant Messaging) and IRC (Internet Relay Chat) are becoming populated. As a result, large volumes of short text messages are accumulated. These short text messages carry public data as well as abundant information about people and become valuable information resources. Mining on short text messages is useful for information management and system optimization. It is also beneficial for focused information extraction and information understanding.Short text message is a kind of short text for interactive communication. It has the characteristics of imperfection, interweavingness, irregularity and confusing similarity. Furthermore, high-speed message stream results in the massive database of archived messages. All these characteristics bring challenges for short text message mining. To build an efficient, effective and elastic system for short text messages, this thesis focuses on the key technologies related to short message clustering, including the preprocessing for short text messages, similarity measurement between short text segments and parallel clustering algorithms.The main contributions of this paper can be summarized as follows:1. A model for short text message clustering is developed. This model amalgamates the messages both in the on-line and archival scenarios. It consists of conversation extraction methods for dynamic stream of short text messages, short text representation and similarity measurement, as well as parallel algorithms for short text message clustering. This model is also the fundamental work of the whole thesis.2. To extract conversations from short text message stream, a conversation extracting algorithm, called DWExter is proposed through analysis to the characteristics of content, syntax and time of short text messages. A double time mechanism is designed based on the time characteristic of messages. This mechanism makes DWExter more efficient. The experiment on real dataset shows that DWExter improves the F-measure about 85% and 18% according to the two baseline algorithms.3. To address the problem of sparse key-words and similarity drift in short text segments, an algorithm called CrtNRG is designed, which defines word similarity based on HowNet and corpus. Based on CrtNRG, we also proposed an algorithm called SiM to represent and measure the similarity between short text segments. Experiments for analyzing short text message stream and clustering on large scale short text message database are performed in order to evaluate our algorithms. The experimental results show that compared with the TF-IDF method, the corpus-based similarity measuring method can improve the cluster quality about 36% and the SiM based method can double the cluster quality.4. The characteristics of the k-means algorithm and the algorithms based on frequent term sets are summarized and a hybrid clustering algorithm called SHDC is proposed by combining the good features of k-means and frequent term sets. SHDC employs SiM to calculate the similarity between short message conversations and produces high quality clusters with descriptive information. In order to reduce the coupling degree among subset of dataset for parallel clustering, a vertical data partitioning policy is designed. Based on this data partitioning policy, the parallel k-means algorithm is optimized and the concept of rough clustering is introduced in this paper. A parallel algorithm called parROC for rough clustering is also implemented. Finally, a parallel hybrid algorithm parSHDC is constructed. Compared with two parallel clustering algorithm, PDDP K-means and parallel k-means, parSHDC improve the cluster quality 12% and 18% respectively. The speed-up ratio of parSHDC is boosted about 38% and 50%.5. Finally, a prototype system for short text message-clustering, called StarSTMiner+, is implemented based on StarTPMonitor, which is a general massive data processing platform and used for data mining. The structure and primary components of StarSTMiner+ are explained in this paper.

Keywords/Search Tags:

short text message, conversation detection, similarity between short text segments, text clustering algorithm, parallel algorithm

PDF Full Text Request

Related items

1	Research On Conversation Extraction And Analysis Of Short Text Message Stream
2	Research On The Key Technology Of Short Message Text
3	Social Media Short Text Clustering And Its Applications
4	Research On Short-message Text Clustering
5	Automatic Summarization Alorgithm For Chiness Short Text
6	Research On Short Text Conversation In Combination Of Retrieval And Generation
7	A Short Texts Matching Methodusing Multi-level Features
8	The Study Of Measures And Applications Of Short Text Semantic Similarity
9	Research On The Generation Method Of Short Text Dialogues With Fusion Of Positions
10	The Research And Implementation Of Massive Short Message Mining Technology