Message Text Clustering Based On Frequent Patterns

Posted on:2007-11-24

Degree:Master

Type:Thesis

Country:China

Candidate:J X Hu

Full Text:PDF

GTID:2178360185954172

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The thriving of network has changed our daily life drastically. Most people choose e-mail,BBS, chat room, IM (Instant Message) and SMS (Short Message Service) as their primarycommunication facilities, rather than traditional letter or telephone. In virtue of these newmeans, people can share and exchange their information quickly, or even in a near real-timeenvironment. The message text arising from human interactivity has been predominant inInternet information flow. Such data not only coveys public message, but also carries massivepersonal information, and thus become an important information source. This thesis aims atcontent-based message text mining through clustering analysis for novel applications such asdynamic topic identification and on-line community discovering.Message text is radically distinct from plain text and static web pages due to its dynamicand informal nature. The large amounts of NIL (Network Informal Language) in message textresult in great difficulties in textual feature (so-called term) extraction. To deal with thisproblem, we discover words and phrases which occur frequently (referred as Frequent Patternin text) so as to effectively identify terms in message text. We summarize some frequent patterndiscovering algorithms, implement them and evaluate their performance on real data sets.Experimental results show that our implementation meets the need of practical applications.Frequent patterns which have stable structures, complete semantics and adequatecirculations are called significant frequent patterns (SFP). Compared with single-words, SFPskeep more useful semantic information such as word order and adjacent proximity, expressmore specific meanings and thus are more appropriate to serve as textual features. We proposea novel term extraction method based on frequent patterns, which can extract meaningful termsfrom text. The proposed method is language-independent, and can be applied to Chinese textwithout word segmentation. Furthermore, we propose an unsurprised feature selection methodbased on SFPs, which can remarkably reduce the dimension yet does not hurt the performanceof classification and clustering. Experimental results demonstrate our approach's effectiveness:it achieves quite comparable or even better performances than state-of-the-art supervisedfeature selection algorithms such as IG and CHI in classification, and good performance ondimension reduction for clustering.Frequent pattern based text clustering algorithms have several advantages over theirtraditional counterparts, i.e., lower dimension, better clustering quality and understandablecluster labels. We verify the effectiveness of SFPs for clustering on message corpora, and theexperimental results show that SFPs can improve the qualities of traditional text clusteringalgorithms, and reduce the dimension remarkably. Moreover, the frequent patterns can aidcluster interpretation.

Keywords/Search Tags:

Message Text, Text Clustering, Frequent Pattern, Term Extraction, Feature Selection

PDF Full Text Request

Related items

1	Text Classification Method Based On Maximum Frequent Sequence Pattern
2	Research And Implementation Of Bad Message Text Detection Method Based On Frequent Pattern Mining
3	Research On Short-message Text Clustering
4	The Research Of Text Representation And Feature Selection In Text Categorization
5	The Research And Implementation Of Massive Short Message Mining Technology
6	Text Classification Method Based On The Longest Closed Frequent Sequential Patterns
7	Research On Clustering Approach For Text Messages
8	The Method Of Text Categorization Scheme Selection And Development Of A Prototype System
9	Research On High Performance Chinese Text Classification Based On Machine Learning
10	Research Of Feature Vector Value Weighted Based On Semantic Analysis In Chinese Text Clustering