Font Size: a A A

Research On Clustering Analysis Algorithm And Implementation In Data-intensive Computing Environments

Posted on:2016-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:S S ZhangFull Text:PDF
GTID:2308330464453340Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
At present, insufficiency in accuracy and completeness(or categoricalness) still exists in cluster analysis. And there is no algorithm that is applicable to and effective in all aspects. In terms of high performance computing, it mainly faces challenges brought by big data sets(data sets in the data-intensive computing environment). These big data sets are normally characteristic of the massive amount, high speed changing status, distribution, isomerism, semi-structured or unstructured types. For such data, traditional data mining algorithms cannot satisfy the processing requirements any more, which gradually becomes the bottleneck issue in data processing technology.After study of entropy-based fuzzy clustering algorithm(EFC) and central-based clustering algorithm, the paper proposes an improved entropy-based central clustering algorithm(IECC). IECC firstly acquires the cluster centers of the original data set with evident differences through EFC, and secondly, conducts cluster analysis again based on the center of the acquired cluster centers; then, each point is re-distributed to the collection represented by each center according to the distance between each point and each center. The improved algorithm cannot only obtain concise, evidently different clustering results, but also effectively elevate the accuracy of clustering results. For the purpose of adapting to the data processing requirements in the data-intensive computing environment, the paper also puts forward a feasible solution that implements IECC on Hadoop distributed platform. The specific implementation process comprises of three stages, namely, Map stage, Combine stage, and Reduce stage. Map process mainly takes place on the partial nodes and aims to obtain cluster centers of the original data set with evident differences as well as the corresponding outliers, which are regarded as the representative points on the node. Next, in the Combine stage, the related information of the cluster centers and outliers obtained in the partial nodes is transmitted to the primary nodes, so as to combine the same cluster centers. Finally, IECC is executed on the primary nodes, i.e. executing IECC for data after the Combine stage. Thus the final clustering results are obtained. Owing to the development of the data-intensive computing, and its unique features, it is recommended to implement the newly proposed algorithms in the data-intensive computing environment, because the problems of data analysis and mining in the data-intensive computing environment will be solved.
Keywords/Search Tags:Data mining, Cluster analysis, EFC algorithm, IECC algorithm, Data-intensive computing
PDF Full Text Request
Related items