Font Size: a A A

Message Text Clustering Based On Frequent Patterns

Posted on:2007-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:J X HuFull Text:PDF
GTID:2178360185954172Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The thriving of network has changed our daily life drastically. Most people choose e-mail,BBS, chat room, IM (Instant Message) and SMS (Short Message Service) as their primarycommunication facilities, rather than traditional letter or telephone. In virtue of these newmeans, people can share and exchange their information quickly, or even in a near real-timeenvironment. The message text arising from human interactivity has been predominant inInternet information flow. Such data not only coveys public message, but also carries massivepersonal information, and thus become an important information source. This thesis aims atcontent-based message text mining through clustering analysis for novel applications such asdynamic topic identification and on-line community discovering.Message text is radically distinct from plain text and static web pages due to its dynamicand informal nature. The large amounts of NIL (Network Informal Language) in message textresult in great difficulties in textual feature (so-called term) extraction. To deal with thisproblem, we discover words and phrases which occur frequently (referred as Frequent Patternin text) so as to effectively identify terms in message text. We summarize some frequent patterndiscovering algorithms, implement them and evaluate their performance on real data sets.Experimental results show that our implementation meets the need of practical applications.Frequent patterns which have stable structures, complete semantics and adequatecirculations are called significant frequent patterns (SFP). Compared with single-words, SFPskeep more useful semantic information such as word order and adjacent proximity, expressmore specific meanings and thus are more appropriate to serve as textual features. We proposea novel term extraction method based on frequent patterns, which can extract meaningful termsfrom text. The proposed method is language-independent, and can be applied to Chinese textwithout word segmentation. Furthermore, we propose an unsurprised feature selection methodbased on SFPs, which can remarkably reduce the dimension yet does not hurt the performanceof classification and clustering. Experimental results demonstrate our approach's effectiveness:it achieves quite comparable or even better performances than state-of-the-art supervisedfeature selection algorithms such as IG and CHI in classification, and good performance ondimension reduction for clustering.Frequent pattern based text clustering algorithms have several advantages over theirtraditional counterparts, i.e., lower dimension, better clustering quality and understandablecluster labels. We verify the effectiveness of SFPs for clustering on message corpora, and theexperimental results show that SFPs can improve the qualities of traditional text clusteringalgorithms, and reduce the dimension remarkably. Moreover, the frequent patterns can aidcluster interpretation.
Keywords/Search Tags:Message Text, Text Clustering, Frequent Pattern, Term Extraction, Feature Selection
PDF Full Text Request
Related items