Font Size: a A A

Related Technologies Research On Short Message Clustering

Posted on:2009-01-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WangFull Text:PDF
GTID:1118360278456609Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and communication technology, IM (Instant Messaging) and IRC (Internet Relay Chat) are becoming populated. As a result, large volumes of short text messages are accumulated. These short text messages carry public data as well as abundant information about people and become valuable information resources. Mining on short text messages is useful for information management and system optimization. It is also beneficial for focused information extraction and information understanding.Short text message is a kind of short text for interactive communication. It has the characteristics of imperfection, interweavingness, irregularity and confusing similarity. Furthermore, high-speed message stream results in the massive database of archived messages. All these characteristics bring challenges for short text message mining. To build an efficient, effective and elastic system for short text messages, this thesis focuses on the key technologies related to short message clustering, including the preprocessing for short text messages, similarity measurement between short text segments and parallel clustering algorithms.The main contributions of this paper can be summarized as follows:1. A model for short text message clustering is developed. This model amalgamates the messages both in the on-line and archival scenarios. It consists of conversation extraction methods for dynamic stream of short text messages, short text representation and similarity measurement, as well as parallel algorithms for short text message clustering. This model is also the fundamental work of the whole thesis.2. To extract conversations from short text message stream, a conversation extracting algorithm, called DWExter is proposed through analysis to the characteristics of content, syntax and time of short text messages. A double time mechanism is designed based on the time characteristic of messages. This mechanism makes DWExter more efficient. The experiment on real dataset shows that DWExter improves the F-measure about 85% and 18% according to the two baseline algorithms.3. To address the problem of sparse key-words and similarity drift in short text segments, an algorithm called CrtNRG is designed, which defines word similarity based on HowNet and corpus. Based on CrtNRG, we also proposed an algorithm called SiM to represent and measure the similarity between short text segments. Experiments for analyzing short text message stream and clustering on large scale short text message database are performed in order to evaluate our algorithms. The experimental results show that compared with the TF-IDF method, the corpus-based similarity measuring method can improve the cluster quality about 36% and the SiM based method can double the cluster quality.4. The characteristics of the k-means algorithm and the algorithms based on frequent term sets are summarized and a hybrid clustering algorithm called SHDC is proposed by combining the good features of k-means and frequent term sets. SHDC employs SiM to calculate the similarity between short message conversations and produces high quality clusters with descriptive information. In order to reduce the coupling degree among subset of dataset for parallel clustering, a vertical data partitioning policy is designed. Based on this data partitioning policy, the parallel k-means algorithm is optimized and the concept of rough clustering is introduced in this paper. A parallel algorithm called parROC for rough clustering is also implemented. Finally, a parallel hybrid algorithm parSHDC is constructed. Compared with two parallel clustering algorithm, PDDP K-means and parallel k-means, parSHDC improve the cluster quality 12% and 18% respectively. The speed-up ratio of parSHDC is boosted about 38% and 50%.5. Finally, a prototype system for short text message-clustering, called StarSTMiner+, is implemented based on StarTPMonitor, which is a general massive data processing platform and used for data mining. The structure and primary components of StarSTMiner+ are explained in this paper.
Keywords/Search Tags:short text message, conversation detection, similarity between short text segments, text clustering algorithm, parallel algorithm
PDF Full Text Request
Related items