New Methods For Cluster Analysis In Distributed Environments  Posted on:20071121  Degree:Doctor  Type:Dissertation  Country:China  Candidate:C A Li  Full Text:PDF  GTID:1118360212489537  Subject:Control Science and Engineering  Abstract/Summary:  PDF Full Text Request  With the rapid development of computer and memory technologies, there is growing interest in clustering theories and applications in data mining due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Cluster analysis is, based on the naive ideathings of one kind come together, a division of data into groups of similar objects and widely applied to many fields.In recent years, databases are persistently growing and distributed physically or geographally in more and more locations connected with computer networks. However, it is difficult for most of existed clustering algorithms to extract knowledge from huge amounts of distributed data because they need to load all data into the main memory and huge computational overhead. Thus new methods of discovering knowledge are necessary to be developed in largescale, distributed environments and distributed clustering method is just one. Distributed clustering is the applications of cluster analysis in distributed computing environments and a challenge topic in data mining fields. This dissertation explores new clustering techniques in distributed environments so as to provide theoretic and technical foundations for utilizing efficiently and suitely largescale, distributed data. And several novel distributed clustering menthods are proposed to cluster largescale, distributed datasets in distributed environments using many techniques such as machine learning, artificial intelligence, distributed computing techniques, etc. The main work and results of the paper are showed in the following:1. Clustering methods in centralized and distributed environments are surveyed in three aspects, which are backgrounds, algorithms and applications of clustering methods.2. For easy implementation of distributed clustering algorithm, a novel distributed clustering algorithm (DBCA) is proposed using some simple and easilyimplemented algorithms such as Kmeans algorithm and boosting techniques. At each iteration of DBCA algorithm, a set of clustering models are first generated from subdatabases at those sites using a weaker clustering algorithm and combined into a global model which is transmitted to the sites and used to partition the subdatabase at each site. Then, in terms of partitioning qualities, sampling probabilities of the next iteration are updated at the sites. Finally, the partitions are integrated into an aggregated partition by a weighted voting. The final clustering result is the aggregated partition at the last iteration. DBCA algorithm is parallelly computable, scalable and has a low communication overhead. It is not only helpful for scientists to investigate cluster analysis but also helpful for common engineers to solve realworld problems using distributed clustering techniques. Experimentalresults show that DBCA algorithm is effective and can achieve results comparable to the algorithms in which boosting techniques are applied to the centralized databases.3. Integration scalability in large amount of sites which contain largescale, distributed data sets is studied. First, a new hierarchical optimization mining model (HOIKI DDM model) based on mobile agent is proposed. Based on hierarchical idea and dividandconquer strategy, the proposed model extends OIKI DDM model according to network topology and bandwidth, and integrates multiple local results among the sites using mobile agent and incremental optimization. Then, a novel distributed clustering algorithm (HOIKIDC) with the proposed model is presented to cluster largescale, distributed heterogeneous data sets. The experimental results demonstrate that HOIKIDC algorithm is scalable, flexible and efficient and particularly suited to largescale distributed environments. In addition, HOIKIDC algorithm can reduce dramatically communication cost based on network characteristics.4. Validity of knowledge integration in distribute clustering is studied. First, integation validity and inconsistency amongst local results from different sites are defined. Then, analysis of inconsistency amongst local results and a coordination algorithm to reduce the inconsistency are proposed. Forethermore, based on the coordination algorithm, a novel distributed clustering algorithm (CDCA) in which information is exchanged amongst the sites is presented to improve clustering quality and integation validity. Experimental results show that CDCA algorithm outperforms the algorithms without cooridination in integation validity.5. For largescale, distributed short timeseries data sets in many fields sach as industries and DNA databases, a distributed clustering algorithm (DFSTS) is proposed to cluster short time series in distributed environments for analyzing the shape similarity hiding amongst the data so as to find its structure. Based on fuzzy clustering, the proposed algorithm is performed in multiple sites without transferring all data to a single dataset. The simulated results demonstrate that the proposed algorithm is effective, efficient and scalable and provides the same clustering quality as the single centralized data set.6. The distributed algorithms proposed in the dissertation are applied to steel plant in a realworld project (National "863" Project) to sovle the real industrial problems. First, a prototype system of distributed data mining is designed to apply distributed algorithms to metallurgy industries. Then, for largescale, distributed data from continuousanneal processes, two distributed mining tasks which employ distributed clustering algorithms: 1) modeling and prediction of striprupture after datapreprocessing; 2) detection of outliers, are performed. The performed results indicate that the distributed approaches are effective and only need to transfer models and knowledge rather than original data. According to the results, great application prospect of distributed clutering approaches proposed in this dissertationcan be expected to analyze largescale, distributed data from metallurgy process industries.  Keywords/Search Tags:  Data mining, distributed computing, distributed clustering, ensemble learning, mobile agent, hierarchial optimization, collaboration, timeseries  PDF Full Text Request  Related items 
 
