Research On Distributed Text Clustering Based On Frequent Item Set

Posted on:2016-03-27

Degree:Master

Type:Thesis

Country:China

Candidate:G J Lin

Full Text:PDF

GTID:2298330467992885

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Text clustering, as a significate field in natural language processing, is a key technology of processing and organizing massive text data. In the era of big data, however, the massiveness of data brings great challenge in aspects of time and accuracy of text clustering. This paper focus on the issue of speed and preciseness in text clustering combined with genetic algorithm, feedback and distributed computing.The main work of this paper can be listed as follows:First, this paper introduces key technologies and algorithms in text clustering. And it particularly describes text preprocessing, including word segmtation, feature selection and feature presentation, as well as traditional text clustering algorithms, and emphasise the genetic algorithm in machine learning including its application in text clustering.Secondly, we propose a distributed model of text clustering, based on frequent item set and correlation analysis in cloud computing environment. This model improves the k-means clustering algorithm by proposing parallel evolution based on frequent item set, which increases the accuracy of feature selection. Furthermore, it enhances the ability of clusterring by correlation analysis to ensure the clusters of training sample data. Moreover, we change the model into MapReduce paradigm because of the massiveness of text data and the parallelism of the algorithm.At last, we use the open source cloud computing framework, Hadoop, to implement the text clustering system above. And experiment presents that the model proposed by this paper have good effect.

Keywords/Search Tags:

Text clustering, Frequent item set, Correlation analysis, Hadoop

PDF Full Text Request

Related items

1	Text Clustering Method Based On Frequent Itemsets
2	Research On Frequent Item Mining And Correlation Analysis In Data Streams
3	Frequent item-based text clustering
4	Search Results Clustering Method Based On Maximal Frequent Itemsets
5	Research Of Frequent Item Data Mining Algorithm Based On Hadoop
6	Research And Improvement The Algorithm Of Mining Frequent Item Sets In Text Association Analysis
7	Research On Mining Algorithms Of Maximal Frequent Item Sets
8	Improvement Of Frequent 1-Item Set Generation Method And Experimental Study
9	Message Text Clustering Based On Frequent Patterns
10	Mining Of Maximal Frequent Item Sets Based On AFOPT