Font Size: a A A

Study Of Distributed Clustering Algorithm Of Data Stream

Posted on:2016-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y HeFull Text:PDF
GTID:2308330467497033Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and network technology, data stream has become a major data model widely used in different fields, including network monitoring, financial analysis, and communication. In the process of data mining, clustering analysis is one of the most popular study for data similarity. With the features of real-time and large volume, current data streams cannot be processed by traditional data clustering algorithm. Therefore, data stream based clustering algorithm became increasingly important.As data stream has the characteristics of real-time, real-timing, sequentiality and infinity, data clustering algorithm is required not only ensure the data accuracy and efficiency, but also has the ability of handling issues of data outlier within the mass data. For this dissertation, following topics are discussed:Firstly, using the classic double-clustering framework to cluster data:introducing decayed time window on the online layer to obtain data stream; while on offline layer, through improving weighted centroid, weighted distance to achieve data stream clustering accuracy, effectively excluding the impact of historical expired data clustering.Secondly, in order to improve the quality of clustering, data clustering algorithm is improved by optimizing outlier judgment processing, abandoning the global outliers justification protocol, approaching outliers by local justification protocol, and modifying micro-cluster structure, essentially achieved more accurate results and solve the cluster misidentification.In addition, for the increasing requirements of algorithm efficiency, a distributed clustering algorithm is deployed and Kernel Principle Component Analysis (KPCA) is applied through the article to pretreat data stream for dimension reduction. Finally, based on the analysis, targeting high-dimensional data sets and data mass, the distributed algorithm modification has the ability to improve high-dimensional data stream clustering analysis efficiency and accuracy with outlier impact.
Keywords/Search Tags:Distributed, Data stream, Clustering algorithm, weighted center, Localoutlier
PDF Full Text Request
Related items