The amount of information data increases exponentially, which brings great challenges to the work of data processing. Statistics show that about70%of the network information is revealed in the form of text. However, such information is often messy, to a certain extent, clustering technology help reclassify and sort the information. Text clustering is an important field in data mining area, which mainly based on the assumptions that the texts in the same cluster might be similar, whereas texts in different clusters are general dissimilar. Also, many e-mails and microbloggings are informative and update fast, and most of the contents are in the form of short text. How to cluster the short texts precisely and efficiently becomes a big challenge.Generally, short texts contain less information than common texts. Most of the words in short texts could not represent the characteristics of short texts, and it is ineffective to use the traditional natural language processing technology to handle the short texts. Therefore, how to capture the features of short texts and use them to cluster texts has been paid more attention recently.In this thesis, short text clustering problem and clustering methods for short texts are proposed. First, this thesis proposes a short text standardized method that constructs feature word set to reduce the high dimension of the short texts. Second, the thesis proposes EJaccard similarity to measure internal degree of polymerization of short texts. Third, k-means clustering algorithm is improved. This thesis first uses a simple hierarchical clustering algorithm to solve the problem of dependences on the original information. The hierarchical clustering is also improved by controlling a given threshold to automaticly control number of clustering. Finally, the thesis proposes local clustering algorithms for emails, which effectively solve different concepts of the partitioning problem in short texts. Also, the thesis proposes a global clustering algorithm, which makes an email involving two or more concepts could be clustered into diffirent groups.This thesis chooses emails as a representative data set. The thesis did experimental test on this real data set to show the effectivity of our proposed local clustering and global clustering methods. The results show that the proposed methods can effectively increase the clustering results of the local and global diversity, thereby improving the short text clustering effect. |