Font Size: a A A

New Methods For Cluster Analysis In Distributed Environments

Posted on:2007-11-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:C A LiFull Text:PDF
GTID:1118360212489537Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and memory technologies, there is growing interest in clustering theories and applications in data mining due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Cluster analysis is, based on the naive idea-things of one kind come together, a division of data into groups of similar objects and widely applied to many fields.In recent years, databases are persistently growing and distributed physically or geographally in more and more locations connected with computer networks. However, it is difficult for most of existed clustering algorithms to extract knowledge from huge amounts of distributed data because they need to load all data into the main memory and huge computational overhead. Thus new methods of discovering knowledge are necessary to be developed in large-scale, distributed environments and distributed clustering method is just one. Distributed clustering is the applications of cluster analysis in distributed computing environments and a challenge topic in data mining fields. This dissertation explores new clustering techniques in distributed environments so as to provide theoretic and technical foundations for utilizing efficiently and suitely large-scale, distributed data. And several novel distributed clustering menthods are proposed to cluster large-scale, distributed datasets in distributed environments using many techniques such as machine learning, artificial intelligence, distributed computing techniques, etc. The main work and results of the paper are showed in the following:1. Clustering methods in centralized and distributed environments are surveyed in three aspects, which are backgrounds, algorithms and applications of clustering methods.2. For easy implementation of distributed clustering algorithm, a novel distributed clustering algorithm (DBCA) is proposed using some simple and easily-implemented algorithms such as K-means algorithm and boosting techniques. At each iteration of DBCA algorithm, a set of clustering models are first generated from sub-databases at those sites using a weaker clustering algorithm and combined into a global model which is transmitted to the sites and used to partition the sub-database at each site. Then, in terms of partitioning qualities, sampling probabilities of the next iteration are updated at the sites. Finally, the partitions are integrated into an aggregated partition by a weighted voting. The final clustering result is the aggregated partition at the last iteration. DBCA algorithm is parallelly computable, scalable and has a low communication overhead. It is not only helpful for scientists to investigate cluster analysis but also helpful for common engineers to solve real-world problems using distributed clustering techniques. Experimentalresults show that DBCA algorithm is effective and can achieve results comparable to the algorithms in which boosting techniques are applied to the centralized databases.3. Integration scalability in large amount of sites which contain large-scale, distributed data sets is studied. First, a new hierarchical optimization mining model (HOIKI DDM model) based on mobile agent is proposed. Based on hierarchical idea and divid-and-conquer strategy, the proposed model extends OIKI DDM model according to network topology and bandwidth, and integrates multiple local results among the sites using mobile agent and incremental optimization. Then, a novel distributed clustering algorithm (HOIKIDC) with the proposed model is presented to cluster large-scale, distributed heterogeneous data sets. The experimental results demonstrate that HOIKIDC algorithm is scalable, flexible and efficient and particularly suited to large-scale distributed environments. In addition, HOIKIDC algorithm can reduce dramatically communication cost based on network characteristics.4. Validity of knowledge integration in distribute clustering is studied. First, integation validity and inconsistency amongst local results from different sites are defined. Then, analysis of inconsistency amongst local results and a coordination algorithm to reduce the inconsistency are proposed. Forethermore, based on the coordination algorithm, a novel distributed clustering algorithm (CDCA) in which information is exchanged amongst the sites is presented to improve clustering quality and integation validity. Experimental results show that CDCA algorithm outperforms the algorithms without cooridination in integation validity.5. For large-scale, distributed short time-series data sets in many fields sach as industries and DNA databases, a distributed clustering algorithm (DFSTS) is proposed to cluster short time series in distributed environments for analyzing the shape similarity hiding amongst the data so as to find its structure. Based on fuzzy clustering, the proposed algorithm is performed in multiple sites without transferring all data to a single dataset. The simulated results demonstrate that the proposed algorithm is effective, efficient and scalable and provides the same clustering quality as the single centralized data set.6. The distributed algorithms proposed in the dissertation are applied to steel plant in a real-world project (National "863" Project) to sovle the real industrial problems. First, a prototype system of distributed data mining is designed to apply distributed algorithms to metallurgy industries. Then, for large-scale, distributed data from continuous-anneal processes, two distributed mining tasks which employ distributed clustering algorithms: 1) modeling and prediction of strip-rupture after data-preprocessing; 2) detection of outliers, are performed. The performed results indicate that the distributed approaches are effective and only need to transfer models and knowledge rather than original data. According to the results, great application prospect of distributed clutering approaches proposed in this dissertationcan be expected to analyze large-scale, distributed data from metallurgy process industries.
Keywords/Search Tags:Data mining, distributed computing, distributed clustering, ensemble learning, mobile agent, hierarchial optimization, collaboration, time-series
PDF Full Text Request
Related items