Font Size: a A A

The Study On Outlier Detection Over Data Streams

Posted on:2011-03-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H TangFull Text:PDF
GTID:1118360305492263Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, a new class of data model-data streams emerges with the development of information technology. The data stream consists of real-time, continuous, ordered sequence of data in all areas of industry and life of people, such as stock trading, train ticketing system, sensor networks et al. Data in data streams is large, continuous and rapid, unpredictable and short-term perishable. These characteristics of data streams cause that a lot of traditional data mining techniques don't work well on data streams. Data mining techniques on data streams are required to have online data mining capacity. They are needed to handle continuous coming data in the limited space of memory and feedback results to users in time. In applications of data streams, recent data instances of data streams usually are more significant than the older ones and may interest some users. To meet the need, the sliding window model is proposed and the range of the sliding window (including time-based and count-based includes the latest instances of data streams.Research on outlier detection over data streams is an important branch of data mining community and has a very important theoretical and practical value. For example, for transaction data of banks, a number of unusual transactions data may indicate financial fraud. In the airport security system, detection of some abnormal behaviors may prevent terrorist attacks. By mining some unusual variations in medical data, health authorities can prevent outbreaks of major infectious diseases. By detection of some defects in product data, the product can quickly understand the status of machines. Because of large amounts of data in data streams, it is impossible to store data of data streams in the second memory and outlier detection algorithms on static data sets don't work on data streams. Therefore, new methods need be proposed to detect outliers in data streams. Since data in data streams is perishable and high-speed, methods over data streams are required to scan data in data stream for one time and incrementally feedback results to users with low spatiotemporal complexity. In the sliding window model, when sliding window continuously moves forward, some old data will fall out of the window and new data will come in the window. Update of data in the sliding window affects the results of outlier detection. To reflect change in the window, the method of outlier detection needs to update outliers and periodically remove stale outliers to improve performance of the methodThe method of outlier detection in recent windows based on the tumbling physics window uses the structure of ROF-tree to maintain frequent patterns and outliers.The method greatly improves efficiency of memory by cutting load and refreshing ROF-tree. As approximate outlier degrees of data instances that the method obtains by conservative strategy are no less than actual outlier degrees of data instances, the method achieves the best goal of no missed outliers.In order to detect outliers in the sliding window of arbitrary size fast and accurately, we proposes a new metric called FPCOF(frequent pattern contradiction outlier factor), and the metric can estimate the outlier degree of data elements intuitively and precisely. Base on the metric, a novel method called ODFP-SW (outlier detection in an arbitrary sliding window over online data streams based on frequent pattern) is developed. In ODFP-SW, a tree called SWODFP-Tree is constructed and FPCOF of a new coming data element is computed simultaneously, when the new coming element is incrementally updated on the SWODFP-Tree. The candidate outlier set and FPCOF of candidate outliers are updated dynamically and change of the information about outliers is reflected, by removal and movement of the candidate outlier sets.In applications of outlier detection over data streams, it is hard or even impossible to select an appropriate threshold of outlier degree for persons. Therefore, people want to mine TOP-K outliers in data streams. A method is proposed to detect TOP-K outliers in the sliding window over data streams. Base on Chernoff bound theorem and the outlier degree of the current Kth outlier, the minimum candidate TOP-K outlier degree threshold is computed by the method. Data instances in the sliding window are divided into two categories:TOP-K candidate outliers and inliers.In order to save memory space, the method will remove stale candidate outliers and inliers, when the sliding window moves forward. Correctness of detection results of the method is ensured by Chernoff bound with high probability.Query processing optimization based on update pattern awareness is a new hot topic in the research field of continuous queries over sliding window. Efficiency of continuous query processing highly depends on overhead of result state maintenance. This investigation proposes a ladder queue with branch lists to maintain result state of continuous query. The trunk list and the branch list are designed into the ladder queue. The ladder queue uses the branch lists to gather the result tuples with the identical expiration time, and employs "spawning" mechanism to achieve O(1) amortized access time complexity for result data under the different distributions.
Keywords/Search Tags:data streams, outlier detection, TOP-K outliers, continuous query
PDF Full Text Request
Related items