Study On Clustering Algorithm Adapt To High-speed Data Stream

Posted on:2014-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:H Q Gao

Full Text:PDF

GTID:2268330425983704

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years, with the rapid development of communication and informationtechnology, people gradually realized some data-intensive application. In theseapplications, the data is no longer traditional static data based on relation model, butin the form of data stream. These applications include financial system, networkmonitoring, security domain, communication data management, manufacturing,sensor network. Massive data arrive fast, real-time, continuously, orderly. To mine thepotential knowledge under the data stream poses new challenges to data miningalgorithm. Data stream clustering analysis is an important method in data streammining area, which has been researched and received a lot of attention in recent years.This paper chose high-speed data stream with noise as the research objective.The paper design and implement an accurate and effective self-adaptive anytime datastream clustering algorithm. This paper mainly does the following work: firstly thepaper introduces the research background, significance and related work in domesticand overseas. Secondly it researches the data stream mining theory and technologyespecially on clustering analysis and sums up advantages and disadvantages of themain data stream clustering algorithm. On the basis of former work, the paper changethe synopsis data structure and design an anytime data stream clustering algorithmcalled SSMC-Tree (Similarity Search with Micro-clusters Tree, SMCC-Tree) byimprove the SS-Tree (Similarity Search Tree). The algorithm adapts two-stagealgorithm framework. The online micro-clustering part uses SSMC-Tree datastructure, and introduces the buffer, hitchhiker processing strategy. The off-linemacro-clustering part, on the basis of micro clusters obtained in the online part getarbitrary shape micro cluster based on density clustering method.The data stream from practical application is fast, thus the paper proposes a localclustering algorithm (LocalAggregate) which is improved from SSMC-Tree above.The algorithm makes pre-clustering before inserting data object into SSMC-Tree. Inaddition, in order to process noise in the data stream, the algorithm adopts an outlierpruning strategy which introduces potential core-cluster queue and outlier clusterqueue. The algorithm ensures the quality of clustering by removing outliersperiodically.Finally, in the open-source clustering framework called MOA (Massive Online Analysis), the paper designs and implements the above algorithms. Relevantexperiments carried out in the synthetic and real data sets show that SSMC-Tree andits improved algorithm have accurate and efficiency performance, which can adapt tothe high-speed data stream with noise and get the clustering results at any time.

Keywords/Search Tags:

Data Stream, Data Mining, Clustering, Self-Adaptive, Pruning Strategy

PDF Full Text Request

Related items

1	Study Of Real Time Data Stream Clustering Based On Damped Window And Pruning Dimension Tree
2	Adaptive Evolving Data Stream Algorithm Based On Time Decay Window
3	Study On Key Technologies Of Frequent Items Mining And Clustering On Data Streams
4	A Density-Based Clustering Algorithm Over Stream Data
5	Research On Dynamic Measurement Based Data Stream Clustering And Its Applications
6	Research On An Application Of Data Stream Query And Data Stream Mining In Oil Field
7	Research On Data Stream Clustering And Its Applications Based On Correlations
8	The Research On Classification Algorithms Over Data Stream
9	Research Of Evolving Data Stream Clustering
10	Study On Data Stream Techniques And Its Application In Electric Power Information Processing