Font Size: a A A

Data Streams Clustering Algorithms Research Based On Spatial Directed Graph And Sequences

Posted on:2013-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:X M YuanFull Text:PDF
GTID:2248330392954866Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, the clustering algorithm based on data stream model has becomeone of the major issues in the field of data mining. It has been extensively studied bydomestic and overseas scholars, but there are still many problems to be researched andresolved. Most existing grid-based stream clustering algorithms are lack of effectivestorage structure for grid cells. Due to the randomness of the partition of grids, theedge points of clusters might be partitioned into the sparse grids. These points wouldbecome noise information out of clusters when we cluster data stream by grid-densitybased algorithm. The sequences clustering in data stream has been researchedextensively, but the similar measure between sequences is also a hard part. Thesolutions of above problems would have important influence on optimizing clusteringalgorithms of stream systems, application and so on.Firstly, a spatial directed graph named GBG (Grid-Based Graph) is defined tostore the non-empty grids in data space. GBG graph composes of the set of vertices (agrid in data space is regarded as a vertex) and the set of directed edges, if a vertex Ahas a neighboring dense vertex B, and then there is a directed edge from vertex B to Ain GBG. The algorithm maps the data stream into the non-empty vertices online,updates the vertices’ feature vectors as the arriving of data stream, deletes the sparsevertices every gap time, generates GBG graph when the clustering quest coming andfinally clusters on the current structure.Secondly, a data stream clustering algorithm based on spatial directed graph withcore, SDGCStream, is proposed. It uses the spatial directed graph and the orthocenterof the sparse grids to handle the edge points of clusters. At first, the algorithm definesa structure SDGC (Spatial Directed Graph with Core) to store the summary statisticsof data stream. The vertices of SDGC are maintained as the stream arriving. When theclustering quest comes, the edge information is generated. The initial clustering resultsare got by clustering on SDGC, then we judge whether the points of sparse gridswhich are adjacent to the border of a cluster belong to the cluster according to the orthocenter information and the border vertices of SDGC. At last, a strategy based onthe distance between clusters is presented to adjust the clustering results after handlingthe border of clusters.Lastly, a similar codes detection algorithm based on sequences clustering isproposed. The similar code fragments in source program are found by clustering thecode sequence, which means much in the field of simplifying the source code ordetecting the similar function among codes. First, we partition the source code intomulti-segments according to the structure of itself; a part of the symbols in sequencesare transformed; the similar code fragments are got by clustering these sequencesbased on the weighted edit distance; then the source code is simplified, or the similarfunctions among multi-codes are found.
Keywords/Search Tags:data streams, clustering, spatial directed graph, spatial directed graph withcore, similar codes detection
PDF Full Text Request
Related items