Font Size: a A A

A Study Of Density-based Clustering And Drifting-concept Detecting For Data Stream

Posted on:2018-07-31Degree:MasterType:Thesis
Country:ChinaCandidate:Z L CuiFull Text:PDF
GTID:2348330512992129Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data stream clustering algorithm is a key data mining technique.In the data stream clustering research,the dominant framework can be divided into two categories:single-phase model and two-phase scheme.The density-based clustering using the two-phase scheme includes the online processing and offline processing.In the online processing,the data is mapped to the grid,in the offline processing,the grids are clustered,so it reduces the difficulty of data stream clustering.But in the offline processing,this clustering framework has three deficiencies:First,using the fixed threshold to determine the sparse grid or dense grid don't adapt to the unevenly distributed data stream and the multi density data stream;Second,based on the density to connect the adjacent grid,it ignore similarity of the adjacent grids that affect the accuracy of the data clustering;Third,it don't consider boundary detection that some boundary points are noise and some boundary points may belong to neighboring clusters.In addition,the concept of data stream may change over time,this phenomenon is known as concept drift.Based on the rough-set theory and the sliding window technology,an existing concept drift detection algorithm,DCDA,is proposed to calculate the distance between two sliding windows to judge the concept drift.But this algorithm has the following shortcomings:First,it only applies to the categorical data;Second,it did not consider a window containing multiple concepts;Third,it can't determine the appropriate sliding window size.In view of the above problems,the main contributions of this paper are as follows:First,aiming at the defects of the DCDA,we propose a framework for data stream concept drifting detection using density-based clustering on a variable-size sliding window named DCDD.Making use of grids,it improves DCDA to be applicable to general data effectively.In solving the problem of multi concept in sliding window,we create a temporary density grid and an old density grid in online processing and extend the detected formula of the DCDA by assigning different weights to data based on arrival time to calculate the distance of temporary density grid and old density grid for detecting concept drift.Instead of the fixed-size sliding window,it uses a variable-size sliding window that we train a prediction model to predict the amount of data in the same concept in the offline processing and adjust the size accordingly.Experimental results have shown that the time of detecting the concept drift is much lower than the DCDA algorithm and our framework detects the concept drift more accurately and efficiently.Second,aiming at the limitation of the density-based clustering,we propose a relative density-based clustering and boundary detection algorithm for data stream.The main idea is:we calculate the similarity of neighboring grids and take the similarity as a weight that affects the connection of the neighboring grids and cluster the grids based on relative difference model that considers the density,centroid and the weight of similarity between adjacent grids.Besides we propose a boundary detection algorithm using a membership function based on the fuzzy set to label the data in sparse grids around a neighboring clusters.The experimental results have shown that our algorithm apply to the multi density data stream and has better clustering quality.
Keywords/Search Tags:Data mining, Data stream, Clustering, Density-based clustering, Concept drift
PDF Full Text Request
Related items