Font Size: a A A

Research On Parallel Clustering Algorithm Based On "90-10" Rule

Posted on:2009-07-02Degree:MasterType:Thesis
Country:ChinaCandidate:C P LiFull Text:PDF
GTID:2178360242991022Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the developing of the computing technology ,information data are more and more. How to extract useful information from large amounts of data becomes an urgent problem, The Data Mining Technology are brought under this condition, it is a very valued new area of research area, and it is a crossed subject that adopts theory and technology of database and artificial intelligent and machine learning and statistics and so on. Clustering analysis is one of the important areas in the data mining research. Especially, it is applicable in many fields, such as image processing, intrusion detection, bioinformatics and others.However, because of the complexity of the distribution of clustering object at high dimensional feature space, the uncertainty and the flexibility of the evaluation of clustering results and high computing complexity of cluster problem as an optimized problem, clustering algorithm still are confronted with lots of problems and challenges.a new parallel data preprocessing algorithm based on"90-10"rule is proposed in this paper. It make an improvement on the existing algorithm,This algorithm can reduce the scale of data and runtime. accounting for one tenth of it in the best situation. Presently the parallel hierarchical algorithms based on SIMD are not adaptive and can not process memory conflicts among different processors. To overcome this shortcomings and Inorder to use the result of data preprocessing better, a new parallel algorithm based on SIMD-EREW is proposed in this paper. Firstly, for the objective of decreasing the number of vertexes before the minimum spanning tree is constructed, the preprocess method based on 90-10 rule is improved, then it is used to preprocess input data points. The proposed algorithms can cluster n objects with O(p) processors in time, where 1≤p≤longn,0.1≤λ≤0.3. Performance comparisons show that it is the first parallel hierarchical clustering algorithm algorithms in the sense of that it is both adaptive and without memory conflicts, and thus it is an improved result over the past researches.To validate the performance and extensibility of our algorithm in practice, we made several experiments and simulations based on the synthetic and benchmark data sets from Internet on IBM P690 of high performance computing centre. The results showed that our algorithm has a great superiority in both computing time and space, and consequently a stronger adaptability and operability for large scale data sets.
Keywords/Search Tags:Data preprocessing, Hierarchical clustering, Memory conflicts, Parallel algorithms, EREW-SIMD
PDF Full Text Request
Related items