Font Size: a A A

Research On Several Technology Of Text Clustering Method And Applications

Posted on:2014-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:S M ShenFull Text:PDF
GTID:2298330431489400Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text clustering is one of the most important research fields in data mining and information retrieval. With the rapid advance of the Internet and the explosion of texts on the Internet, It has been urgent need for people to acquire more useful information and increase the efficiency of acquirement of information from the Internet. Facing the problem of how to require useful information from large scale corpus and how to efficiently utilize this information, in this study, several key technologies and application of text clustering have been focused, which are the selection of initial centroids for k-mean, text clustering for detecting topics and implementing of proposed algorithm in MapReduce.First, an original algorithm to optimize the selection of initial centroids for k-means algorithm based on the small world network has been proposed in order to solve the problem that the clustering result of the traditional k-means algorithm was unstable due to the initial clustering center was generated randomly. The experiments show that results are obtained by the proposed method can be stability and efficiency. Second, due to the shortcoming of utilizing text clustering to detect popular topic, a novel text clustering method for topic detection has been proposed in this study. This method can distinguish the noise texts which may have negative impacts in detection of popular topics, which is meaningful to enhance the performance of using text clustering to detect popular topics. Despite of the weakness in terms of F measure, the proposed algorithm obtains higher performance compared to bisecting k-means in terms of overall similarity and average cluster similarity. Finally, due to the challenges posed by massive dataset both in possibility and efficiency of existent text clustering algorithms, A parallel network decomposition text clustering algorithm based on MapReduce has been proposed, which can work effectively on big data.
Keywords/Search Tags:Clustering Algorithm, Text Data, K-Means, Topic Detection, MapReduce, Big Data
PDF Full Text Request
Related items