Font Size: a A A

Research And Application On Short Message Text Clustering

Posted on:2012-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:C Q FanFull Text:PDF
GTID:2218330368477483Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the development of communication technology based on internet and the speedup pace of modern life, a great deal of instant interactive communication tools, such as mobile phone, forum and Twitter etc, have become extensive popular and applied, which, consequently, produced massive text data of short message. It is significant to analyze and investigate these short message texts for abstracting the hotspot information, grasping the public opinion, comprehending information as well as recommend commodity. In the ordinary study of text clustering, the objects of clustering are common texts of moderate length, most of which are standard and vocabulary in which appear repetitiously with high possibility. Words in texts attributed to the same cluster insect or cover with each other in a certain extent, whereas the more context in two different texts insect with each other, the higher possibility of the two texts in the same cluster exists. However, owing to the unique language features of short text, processing technique of short text differs from that of the natural language in traditional text. Due to the context of a single short text is very short, whose sample characteristics are rare, it is hard to extract effective language features. Because the text of short message has the real time feature and the amount of data is in the state of increasing, the processing technique of short text demands much higher efficiency than that of ordinary text. Compact expression, inaccurate spelling, nonstandard diction and more noises of short text, put forward higher appeals in pretreatment and extracting the characteristic of short message text. All the causations bring on significant challenge as well as research in clustering processing of short message text.Based on the background of extracting short message text and the research of interrelated clustering technique of short message text, this thesis spread a series of research, including the collection, pretreatment, features extracting and comparability measurement of short message text, and clustering algorithm of short message text. Because of the dynamic, interactive, nonstandard and massive characteristics of short message text, requirements for clustering of short message text are put forward in three aspects, the validity of clustering, temporal complexity of clustering algorithm and intelligibility of clustering results Aim at the above requirements and improving the validity of clustering and temporal complexity of clustering algorithm, this thesis carry on some interrelated investigations and carry out an archetypal system oriented to short message text based on the experimental results. The primary content of this thesis contains the following aspects:This thesis extensively and intensively researches the correlative theories and techniques of text clustering, especially expatiate on and compare three aspects, the indicatory model of text, clustering algorithm of text and clustering evaluating indicator as well as discourse at length upon the investigative actuality, theoretic elements and technique. The data sources and characteristics of short message text are summarized in the thesis. Pretreatment techniques of short message text, including Chinese participle, characteristics of extracting and selecting, are also studied and narrated in a certain extent. Based on the processes and procedures of the text clustering disposal of vector space model, vector space model is adopted to express the short message text in vector quantities. K-Means algorithm, which has been in use widely, was adopted to cluster the short message data, results of which have been analyzed and evaluated. Suffix tree clustering (STC), which has achieved preferable effect in English text clustering, has been applied in the clustering of Chinese text and improved for adapting the characteristics of short message text clustering, combined with the issues such as expression and extraction of characteristics and clustering algorithm in Chinese text clustering.Via contrastive experiment of the same short message data, this thesis educed such a conclusion:In short message text clustering, both the validity and temporal complexity of clustering results based on the STC excelled that based on K-Means algorithm. Based on the results of the experiment and requirements of project, a clustering archetypal system oriented short message text has been designed and come true. The system can grab the short message text on web, cluster the data of short message texts, and discover the hotspot topics.
Keywords/Search Tags:Short Message Text, Text Clustering, VSM, STC, K-Means
PDF Full Text Request
Related items