Font Size: a A A

Clustering And Anomaly Detection Over Data Streams

Posted on:2010-08-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:1488302756960859Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a new data model, data stream plays an important role in many applications,such as network traffic management, financial monitoring, e-business, trafficcontrol, information publich/subscribe, copy right protection, environment monitoringas well as flow management in industry and so on. The processing and miningtechnologies over data streams have been widely studied. The infinite and highspeed characters of data streams and the requirement of fast online response forthese applications break many assumptions in traditional databases. Many basicdata mining techniques in traditional databases need to be re-examined.A lot of methods on stream clustering and anomaly detection have been proposed,however there are many problems need to be researched and resolved. Inthis paper, we study the problem of clustering data streams over uncertain data andanomaly detection over multiple data streams. The main contributions of this thesisinclude the following aspects. The details are given as follows:·We propose an algorithm named EMicro, which extends the deterministicstream clustering framework to uncertain stream situation. The challenge ofdata mining over uncertain data stream mainly lies in two aspects, for onething, we need design efficient algorithms to fit for the fast speed of streamevolution, for another, the probability data stream suffers the problem ofinformation uncertainty. So n our method, first we propose a new semanticof clusters in uncertain scenario, then elaborate an innovative point absorbingstrategy based on the probability gravitation, finally, we provide a two-levelabnormal point judgement mechanism in order to enhance the clusters' quality.Our extensive experimental studies show that the proposed algorithms canachieve a high effectiveness with less memory consumption and CPU time.·We propose an information theory based method named EnMicro in order totackle the problem of clustering over uncertain data streams. Facing uncertain tuples with different probability distributions, the clustering algorithm shouldnot only consider the tuple value but also emphasis on its uncertainty. Tofulfill these dual purposes, a metric named tuple uncertainty will be integratedinto the overall procedure of clustering. Firstly, we propose the uncertaintymeasurement and corresponding properties. Then, based on this novel kindof uncertainty qualification measurement, we design a hybrid decay modelto remain high quality data. Finally, based on above tools, we propose aprobability stream clustering algorithm. Experiments demonstrate that ourmethod has high quality, fast processing rate and is efficiently fitting with theuncertain data streams.?We present a method to monitor trend evolution among multiple data streamsand detect the abnormal ones. Firstly, a novel definition of trend in single datastream is introduced, it can capture the tendency with low time and spacecost, meanwhile a trend tracing accuracy criterion is also designed in order toprovide a ground true among different trend indicators. Secondly, we improvea SVD-based(Singular Value Decomposition) method in order to select theoptimal initial parameters, in addition a novel chessboard named sketch isalso provided aiming at adjusting the parameters online. Finally, utilizing theskewness of trend distribution over multiple streams, we propose a method tolabel the abnormal one. Experiment evaluation shows that our method cannot only capture meaningful trend anomalies in a variety of settings but alsorequire order of magnitude less time and space than previous work.?We introduce a DiCAS named network traffic monitoring tool developed byus, which is based on data stream processing technology. In order to meetthe requirement for network traffic monitoring of Shanghai Telecom backbonenetwork, a system is designed and implemented. DiCAS adapts dimensionreduction analysis method to monitor the huge amount of SNMP messageswhich gathered from touters in telecom backbone network. It monitors thecorrelation of thousands of network links and detects the traffic anomalies.Experiments and deployments in a real-life environment show that DiCAScould meet the need of traffic monitoring on backbone network and greatlyimprove the performance of the monitoring system.The thesis combines streaming techniques with the characters of uncertain dataand proposes efficient clustering and anomaly detection algorithms. For one thing,it greatly improves the efficiency of clustering analysis over uncertainty data stream. For another, our methods are also great complementarity and improvement to existinganomaly detection technology. Theoretical analysis and experimental resultsshow that our methods can solve their corresponding problems efficiently and outperformprevious processing methods in space complexity, processing rate and resultquality.
Keywords/Search Tags:data stream, uncertain data management, anomaly detection, clustering analysis, detection system
PDF Full Text Request
Related items