Research On Text Clustering Algorithm Based On Cloud Computing

Posted on:2015-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Feng

Full Text:PDF

GTID:2268330425987605

Subject:Computer applications and technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and communications networks, web text is being to become the main carrier of information and indispensable primary source of information in people’s lives.On the one hand, with the arrival of Web2.0era, the network constantly produces large amounts of text data every day, and this speed is much faster than the people’s ability to use this information. How to get valuable information and knowledge from these larger and larger text resources has become a major issue to be solved; On the other hand, owing to the bottleneck restriction in ordinary personal computer hardware and software, we cannot process and analysis for these massive scale, multi-source heterogeneous, high noise and strong limitation of data within the time range can withstand, and get the knowledge decision-makers required. The appearance of cloud computing model makes the emergence of high-performance computing resources, software resources, hardware resources and services resources to be shared, and now become one of the hot research fields of information. Therefore, the study of the massive text sets clustering algorithm based on a distributed platform come to become a hot topic in the field of the data mining.In this paper, we firstly realized the design of distributed k-means algorithm based on the HIVE and found that we can get a improvement for the speedup of distributed computing for K-means algorithm, the algorithm that many research papers is studying recently. Then we propose a CURE clustering algorithm which is based on the a distributed system architecture--Hadoop, that is developed by a Google laboratory, the experiment is divided into four parts, separately using the distributed computing platform to calculate the parameters, the TFIDF value, the cosine distance between the texts and the specific clustering algorithm, then we contrast the result that get from the data sets of different sizes running on the different number of slave nodes, and found that the flexibility of this algorithm is relatively good and is more suitable for large data sets. After conducting these two experiments, the paper contrast the clustering algorithm CURE results to the experimental results of K-means algorithm based on HIVE and found that for smaller data sets, there is little difference between the two algorithms, but for large data sets, CURE clustering algorithm’s data scalability is significantly better than the K-means algorithm’s which is based on HIVE. Therefore, we find that the former is more applicable to distributed large text sets In summary, by analyzing of the evaluation results we obtained by the experimental on the UCI data sets, for massive data clustering, we found that using the CURE algorithm based on the distributed computing platform is still a good prospect.

Keywords/Search Tags:

cloud computing, CURE cluster, large text sets, distributed

PDF Full Text Request

Related items

1	Key Mechanism Of Large Scale Evaluation Cloud Platform: Task Management And Monitoring
2	Large Scale Cloud Computing Cluster Monitoring System: Design And Implementation
3	Design And Implementation Of Cloud Mnitoring System Based On Cluster Server
4	Research On Parallel Processing Technology Of Large-scale Text Mining Under Cloud Computing Environment
5	Large Data Sets Sample Selection Based On Map Reduce
6	Research On Performance Optimization Of Large Scale Elastic Resource In IaaS Cloud Computing
7	Research And Implementation Of Execution Optimization For Graph Computing With Application Resource Awareness In Cloud Environment
8	Cluster Based Large-scale Distributed Graph Processing System
9	Load Balancing Problems For Parallel And Distributed Computing
10	Research On Method And Implementation Of Monitor And Management For Cloud Computing Cluster Server System