Font Size: a A A

Research On DBSCAN Algorithm Based On Cloud Computing

Posted on:2014-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q F LuoFull Text:PDF
GTID:2268330398498125Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the network technology and the wide usage of computer application, data storage increases rapidly, how to effectively use the massive historical data to analyze the current situation and forecast the trend, has become a key challenge facing the industry. To solve this problem pushed the produce and development of data mining’ technology, at present, the data mining technology has been widely used, it has many applications such as, retail, finance, telecommunications, medicine and astronomy. As its important part, clustering analysis has been widely used in pattern recognition, data analysis, image processing, market research and other fields. The DBSCAN algorithm in clustering analysis, because of its ability to discover clusters of arbitrary shape in the data space containing noise, has been widely used; it also has become a very active research topic in data mining research field.Cloud computing is a hot topic at home and abroad, it is the development of current high performance computational model, it is a numerical model of virtualization through the network to provide services dynamically scalable resources. Through the cloud computing, people can get dynamically extensible computing and storage capacity on the network. Cloud computing can improve the data processing efficiency, while reducing the terminal equipment requirements; it can effectively solve the problems which mass data processing faced. Therefore, the topic of cloud computing who based on the distributed data mining platform is a hot research.This paper is based on the practice of the key breakthrough project, this paper analyzes and studies the technology of cloud computing and data mining, it focused on the DBSCAN clustering algorithm based on density. Aiming at the shortcomings of DBSCAN clustering algorithm and combined with the project of charging station data characteristics, this paper proposed a new algorithm. This algorithm is the DBSCAN clustering algorithm based on grid control factor, which is used the fixed grid size DBSCAN algorithm in the project as the foundation. In order to find a better clustering accuracy grid size, a grid control factor value is to adjust the size of the grid. The paper has been proved that it has the improved clustering accuracy by the test of charging station data, is also effectively reduce the time complexity. Second important problems to be solved in this paper is to do the parallel processing for the improved algorithm, and then realize it on the cloud computing platform. To carry out the clustering analysis of massive data sets, we must ensure that the system can be maintained at a stable, efficient environment. The paper designs a parallel algorithm based on Hadoop, built a simple Hadoop environment, through the encapsulation of the DBSCAN clustering algorithm in the framework of MapReduce, it greatly improving the efficiency of this algorithm. Finally, I verified the improved algorithm based on cloud computing by using replication large-scale charging station data, the experimental results show that, the DBSCAN algorithm based on cloud computing greatly improved the processing efficiency of large data sets, in the condition of not reducing the DBSCAN clustering quality.
Keywords/Search Tags:data mining, Cloud computing, DBSCAN algorithm, grid control factor
PDF Full Text Request
Related items