Font Size: a A A

Research And Implementation Of Distributed Data Mining Model Based On DBSCAN

Posted on:2010-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:2178360272495901Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past few decades, with the rapid development of information technology, data inthe world and in our life showed the trend of explosive growth. People are inundated bydata. Useful information to our production and life is hidden in the data. How to pick upuseful information from the broad array of data to guide our production and life becomes anissue with great concern. So a new computer technology—Data Mining, which is widelyused and has tremendous practicality, is generated.Data Mining (also called Knowledge Discovery) is a non-trivial process which extractsimplicit, unknown and potentially useful information and knowledge from abundant,incomplete, noisy, fuzzy, stochastic data. It is not a whole new subject, but amultidisciplinary field. And it is affected by a lot of subjects, including Database System,Statistics, Machine Learning, Visualization and Information Science, etc. Data mining has avariety of analytical techniques, such as Association Analysis, Sequential Pattern Analysis,Classification Analysis and Clustering Analysis. This paper conducts a study of ClusteringAnalysis.Clustering analysis is an important research area of data mining and is an importantmethod of data partition or grouping. Clustering has been used in various ways includingcommerce, market analysis, biology, Web classification and so on. A cluster is a collectionof data objects that are similar to one another within the same cluster and are dissimilar tothe objects in other clusters. This paper focus on the study of density connectivity-basedclustering algorithm DBSCAN.Traditional methods of data mining complete the mining task on a single computer andusing serial algorithms on single data set. However, the growth of the data set ,as well asthe distribution of the data itself makes traditional methods can not meet our needs moreand more. And this contributes to study of distributed data mining.Distributed data mining is a process, which gets the knowledge in data sets, usingdistributed computing technology. The data sets are distributed on multiple sites and areusually stored in distributed databases. Applying distributed technology in the field of data mining is an innovation of the data mining research with attractive prospect and much value.The focus of the study is how to combine distributed technology with data mining well.Then it can handle the distributed data sets effectively and quickly.In this paper, we firstly study DBDC algorithm, which is a distributed clusteringalgorithm based on DBSCAN. Then we implement the distributed data mining model usingjava socket. Also we test and evaluate the function of the model. Finally we put forward anew design idea after further analysis of the model.The first chapter of this paper gives the introduction of the distributed data miningresearch background and significance. And the present research status in domestic andforeign is described.The study of this field abroad is relatively early and some researchfindings come into existence. However, the domestic research is still at the starting point,with a larger study space.As the basis of the design and implementation of this model, the second chapterillustrates the existing distributed programming model and technology. It focuses on thetechnology—java socket used in our distributed data mining model. There are two types ofdistributed programming model: client/server model and object-based model. The maintypical technologies are socket, RMI, CORBA and Agent. We use C/S model and javasocket to implement our model.The third chapter particularly introduces the DBSCAN algorithm and DBDC algorithmused in this paper. DBSCAN is a density connectivity-based algorithm which yields thefollowing advantages: (1)Minimal requirements of domain knowledge to determine theinput parameters. User needs to manual input two parameters Eps and Minpts asnecessary.(2)Discovery of clusters with arbitrary shape, because the shape of clusters inspatial databases may be spherical, drawn-out, linear, elongated etc. (3)Good efficiency onlarge databases. DBDC is a distributed clustering algorithm based on DBSCAN. The wholeclustering process can be divided into two stages: local clustering and global clustering.And they make clustering analysis both using DBSCAN because of its superiority. Thewhole program includes the following two parts: Server program and Client program. Theyhave different function and are implemented in different ways. The client sites completepart clustering using the client program and transmit the results to the server. The servercompletes global clustering with the server program and then transmits the result to theclients in turn.The forth chapter of this paper details the detailed and the overall design process of theentire distributed data mining model. According to the function, the model can be dividedinto four modules: data transmission module, data loading module, algorithmimplementation module and part clustering results combining module. The data transmission module is the cornerstone of the model to ensure the stability of datatransmission from one side to another. There are data sets transmitted between the serverand the client in the model. Multi-thread technology is used to implement the module. Theserver and the client both have separate threads to communicate. The data loading moduleis responsible for loading original data sets. In this paper the data is stored in the format ofarff document, which comes from Weka. Data loading is reading data from arff documentand loading all the data records into the memory. Algorithm implementation module is thecore of the model. As the functional requirement on the client is different from it on theserver, the module is made up of the client algorithm part and the server algorithm part. Theclient algorithm part loads the data first. Then it makes cluster analysis on partial data setsusing DBSCAN. It also marks all the core points. Last it finds the representative points.The number of representative points is small and they are selected from the core points.They can reflect the local clustering results. All the clients send the local representativepoints to the server as the input data set. The server algorithm part loads this data set andgenerates the global clustering results. DBSCAN needs two parameters Eps and Minpts.The client uses the optimal values which are obtained through experiments. The value ofEps used on the server is twice on the client. The part clustering results combining moduleintegrates the representative points from all the clients. And the site number, the client IPaddress and the local cluster number are recorded in the server. The rate of accuracy of themodel is tested by a classic data set named Iris.The fifth chapter analyzes the model and points out the deficiencies from theperspective of network model. The model is based on the client/server mode. The servermay become the performance bottlenecks of the model and a new design idea is developed.It is based on P2P theory. And a lot of work has to do to study the practicability andefficiency of the new idea.The experimental analysis of this paper indicates that the model can deal withclustering problem whose attribute numerical are all numerical very well. Of course, themodel needs further improvement and refinement, adding more features, and becomes ageneral model for experiment and analysis.
Keywords/Search Tags:Data mining, Distributed, Clustering, DBSCAN
PDF Full Text Request
Related items