Font Size: a A A

Research On Distributed Text Clustering Based On Frequent Item Set

Posted on:2016-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:G J LinFull Text:PDF
GTID:2298330467992885Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text clustering, as a significate field in natural language processing, is a key technology of processing and organizing massive text data. In the era of big data, however, the massiveness of data brings great challenge in aspects of time and accuracy of text clustering. This paper focus on the issue of speed and preciseness in text clustering combined with genetic algorithm, feedback and distributed computing.The main work of this paper can be listed as follows:First, this paper introduces key technologies and algorithms in text clustering. And it particularly describes text preprocessing, including word segmtation, feature selection and feature presentation, as well as traditional text clustering algorithms, and emphasise the genetic algorithm in machine learning including its application in text clustering.Secondly, we propose a distributed model of text clustering, based on frequent item set and correlation analysis in cloud computing environment. This model improves the k-means clustering algorithm by proposing parallel evolution based on frequent item set, which increases the accuracy of feature selection. Furthermore, it enhances the ability of clusterring by correlation analysis to ensure the clusters of training sample data. Moreover, we change the model into MapReduce paradigm because of the massiveness of text data and the parallelism of the algorithm.At last, we use the open source cloud computing framework, Hadoop, to implement the text clustering system above. And experiment presents that the model proposed by this paper have good effect.
Keywords/Search Tags:Text clustering, Frequent item set, Correlation analysis, Hadoop
PDF Full Text Request
Related items