Font Size: a A A

Research On Text Clustering Algorithm Based On Cloud Computing

Posted on:2015-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:X Y FengFull Text:PDF
GTID:2268330425987605Subject:Computer applications and technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and communications networks, web text is being to become the main carrier of information and indispensable primary source of information in people’s lives.On the one hand, with the arrival of Web2.0era, the network constantly produces large amounts of text data every day, and this speed is much faster than the people’s ability to use this information. How to get valuable information and knowledge from these larger and larger text resources has become a major issue to be solved; On the other hand, owing to the bottleneck restriction in ordinary personal computer hardware and software, we cannot process and analysis for these massive scale, multi-source heterogeneous, high noise and strong limitation of data within the time range can withstand, and get the knowledge decision-makers required. The appearance of cloud computing model makes the emergence of high-performance computing resources, software resources, hardware resources and services resources to be shared, and now become one of the hot research fields of information. Therefore, the study of the massive text sets clustering algorithm based on a distributed platform come to become a hot topic in the field of the data mining.In this paper, we firstly realized the design of distributed k-means algorithm based on the HIVE and found that we can get a improvement for the speedup of distributed computing for K-means algorithm, the algorithm that many research papers is studying recently. Then we propose a CURE clustering algorithm which is based on the a distributed system architecture--Hadoop, that is developed by a Google laboratory, the experiment is divided into four parts, separately using the distributed computing platform to calculate the parameters, the TFIDF value, the cosine distance between the texts and the specific clustering algorithm, then we contrast the result that get from the data sets of different sizes running on the different number of slave nodes, and found that the flexibility of this algorithm is relatively good and is more suitable for large data sets. After conducting these two experiments, the paper contrast the clustering algorithm CURE results to the experimental results of K-means algorithm based on HIVE and found that for smaller data sets, there is little difference between the two algorithms, but for large data sets, CURE clustering algorithm’s data scalability is significantly better than the K-means algorithm’s which is based on HIVE. Therefore, we find that the former is more applicable to distributed large text sets In summary, by analyzing of the evaluation results we obtained by the experimental on the UCI data sets, for massive data clustering, we found that using the CURE algorithm based on the distributed computing platform is still a good prospect.
Keywords/Search Tags:cloud computing, CURE cluster, large text sets, distributed
PDF Full Text Request
Related items