Research On Clustering Analysis Algorithm And Implementation In Data-intensive Computing Environments

Posted on:2016-02-12

Degree:Master

Type:Thesis

Country:China

Candidate:S S Zhang

Full Text:PDF

GTID:2308330464453340

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

At present, insufficiency in accuracy and completeness(or categoricalness) still exists in cluster analysis. And there is no algorithm that is applicable to and effective in all aspects. In terms of high performance computing, it mainly faces challenges brought by big data sets(data sets in the data-intensive computing environment). These big data sets are normally characteristic of the massive amount, high speed changing status, distribution, isomerism, semi-structured or unstructured types. For such data, traditional data mining algorithms cannot satisfy the processing requirements any more, which gradually becomes the bottleneck issue in data processing technology.After study of entropy-based fuzzy clustering algorithm(EFC) and central-based clustering algorithm, the paper proposes an improved entropy-based central clustering algorithm(IECC). IECC firstly acquires the cluster centers of the original data set with evident differences through EFC, and secondly, conducts cluster analysis again based on the center of the acquired cluster centers; then, each point is re-distributed to the collection represented by each center according to the distance between each point and each center. The improved algorithm cannot only obtain concise, evidently different clustering results, but also effectively elevate the accuracy of clustering results. For the purpose of adapting to the data processing requirements in the data-intensive computing environment, the paper also puts forward a feasible solution that implements IECC on Hadoop distributed platform. The specific implementation process comprises of three stages, namely, Map stage, Combine stage, and Reduce stage. Map process mainly takes place on the partial nodes and aims to obtain cluster centers of the original data set with evident differences as well as the corresponding outliers, which are regarded as the representative points on the node. Next, in the Combine stage, the related information of the cluster centers and outliers obtained in the partial nodes is transmitted to the primary nodes, so as to combine the same cluster centers. Finally, IECC is executed on the primary nodes, i.e. executing IECC for data after the Combine stage. Thus the final clustering results are obtained. Owing to the development of the data-intensive computing, and its unique features, it is recommended to implement the newly proposed algorithms in the data-intensive computing environment, because the problems of data analysis and mining in the data-intensive computing environment will be solved.

Keywords/Search Tags:

Data mining, Cluster analysis, EFC algorithm, IECC algorithm, Data-intensive computing

PDF Full Text Request

Related items

1	Research On Algorithm Of Outlier Mining In Data-intensive Computing Environments
2	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
3	Reseach On Data Placement Strategy For Data-intensive Applications In Cloud
4	Data Mining Technology And Its Application In The Supermarket In Crm
5	Research On Replica Optimization Strategy In Data-intensive Computing
6	Design Of Energy-efficient Reconfigurable System Architectures For Data-intensive Computing
7	The Application Of Cluster Analysis Algorithm In HMIS
8	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
9	Research And Application Of Apriori Algorithm Based On Cluster And Compression Matrix
10	Design And Development Of Data-intensive Computing Oriented Ship Emergency Response System