Research On Short Text Clustering Techniques And The Applications On Emails

Posted on:2012-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:J B Dai

Full Text:PDF

GTID:2248330395958257

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The amount of information data increases exponentially, which brings great challenges to the work of data processing. Statistics show that about70%of the network information is revealed in the form of text. However, such information is often messy, to a certain extent, clustering technology help reclassify and sort the information. Text clustering is an important field in data mining area, which mainly based on the assumptions that the texts in the same cluster might be similar, whereas texts in different clusters are general dissimilar. Also, many e-mails and microbloggings are informative and update fast, and most of the contents are in the form of short text. How to cluster the short texts precisely and efficiently becomes a big challenge.Generally, short texts contain less information than common texts. Most of the words in short texts could not represent the characteristics of short texts, and it is ineffective to use the traditional natural language processing technology to handle the short texts. Therefore, how to capture the features of short texts and use them to cluster texts has been paid more attention recently.In this thesis, short text clustering problem and clustering methods for short texts are proposed. First, this thesis proposes a short text standardized method that constructs feature word set to reduce the high dimension of the short texts. Second, the thesis proposes EJaccard similarity to measure internal degree of polymerization of short texts. Third, k-means clustering algorithm is improved. This thesis first uses a simple hierarchical clustering algorithm to solve the problem of dependences on the original information. The hierarchical clustering is also improved by controlling a given threshold to automaticly control number of clustering. Finally, the thesis proposes local clustering algorithms for emails, which effectively solve different concepts of the partitioning problem in short texts. Also, the thesis proposes a global clustering algorithm, which makes an email involving two or more concepts could be clustered into diffirent groups.This thesis chooses emails as a representative data set. The thesis did experimental test on this real data set to show the effectivity of our proposed local clustering and global clustering methods. The results show that the proposed methods can effectively increase the clustering results of the local and global diversity, thereby improving the short text clustering effect.

Keywords/Search Tags:

short text clustering, data mining, similarity, k-means clustering, emails

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Text Clustering System Based On K-means
2	Social Media Short Text Clustering And Its Applications
3	Text Mining Based On Clustering Algorithm
4	Research And Its Application Of Web Short Text Clustering Method Based On K-Means Algorithm
5	Scmi-superviscd K-means Clustering Algorithm In Data Mining
6	Research On Fuzzy Clustering Analysis In Data Mining
7	Text Clustering Based On K-means Algorithm And Realization
8	Study On Similarity-based Text Clustering Algorithm And It's Application
9	The Research And Application Of Text Clustering Based On Improved K-means Algorithm
10	Clustering Algorithm Research Of Short Text Based On Semantic Similarity