Font Size: a A A

Study On Clustering Algorithm Adapt To High-speed Data Stream

Posted on:2014-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:H Q GaoFull Text:PDF
GTID:2268330425983704Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of communication and informationtechnology, people gradually realized some data-intensive application. In theseapplications, the data is no longer traditional static data based on relation model, butin the form of data stream. These applications include financial system, networkmonitoring, security domain, communication data management, manufacturing,sensor network. Massive data arrive fast, real-time, continuously, orderly. To mine thepotential knowledge under the data stream poses new challenges to data miningalgorithm. Data stream clustering analysis is an important method in data streammining area, which has been researched and received a lot of attention in recent years.This paper chose high-speed data stream with noise as the research objective.The paper design and implement an accurate and effective self-adaptive anytime datastream clustering algorithm. This paper mainly does the following work: firstly thepaper introduces the research background, significance and related work in domesticand overseas. Secondly it researches the data stream mining theory and technologyespecially on clustering analysis and sums up advantages and disadvantages of themain data stream clustering algorithm. On the basis of former work, the paper changethe synopsis data structure and design an anytime data stream clustering algorithmcalled SSMC-Tree (Similarity Search with Micro-clusters Tree, SMCC-Tree) byimprove the SS-Tree (Similarity Search Tree). The algorithm adapts two-stagealgorithm framework. The online micro-clustering part uses SSMC-Tree datastructure, and introduces the buffer, hitchhiker processing strategy. The off-linemacro-clustering part, on the basis of micro clusters obtained in the online part getarbitrary shape micro cluster based on density clustering method.The data stream from practical application is fast, thus the paper proposes a localclustering algorithm (LocalAggregate) which is improved from SSMC-Tree above.The algorithm makes pre-clustering before inserting data object into SSMC-Tree. Inaddition, in order to process noise in the data stream, the algorithm adopts an outlierpruning strategy which introduces potential core-cluster queue and outlier clusterqueue. The algorithm ensures the quality of clustering by removing outliersperiodically.Finally, in the open-source clustering framework called MOA (Massive Online Analysis), the paper designs and implements the above algorithms. Relevantexperiments carried out in the synthetic and real data sets show that SSMC-Tree andits improved algorithm have accurate and efficiency performance, which can adapt tothe high-speed data stream with noise and get the clustering results at any time.
Keywords/Search Tags:Data Stream, Data Mining, Clustering, Self-Adaptive, Pruning Strategy
PDF Full Text Request
Related items