Font Size: a A A

Clustering Analysis And Outlier Detection Algorithms On Uncertain Data

Posted on:2015-05-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:K Y CaoFull Text:PDF
GTID:1318330482954586Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As a new data type, uncertain data is widely used in many applications, such as finance, location-based services, mobile monitoring, sensor networks, and so on. In recent years, analysis and processing technology based on uncertain data has become a hot research in databases, data mining and other fields. The existence of uncertainty brought new challenges to researchers:on one hand, the basic concepts of traditional data mining techniques which are based on certain data need to be extended for uncertain data; on the other hand, the number of instances in the possible world is exponential growth, which will increase the complexity and running time of algorithms.On the basis of basic concepts and algorithms of data mining, the background of uncertain data and characteristics of the data model, we study the problem of clustering and outlier detection on uncertain data in this dissertation. And the purpose is to provide more variety clustering analysis and outlier detection over uncertain data. The main contributions of this dissertation include the following aspects:(1) We propose an algorithm named OBS-UK-means, which clusters uncertain data in an obstacle space. In order to ensure the accuracy and improve the efficiency of algorithm, we propose the shortest distance area concept and two pruning strategies based on R-tree and Voronoi diagram respectively, which greatly reduces the calculations. The experiment demonstrates that the efficiency and accuracy of the OBS-UK-means algorithm, and the pruning approach can improve the efficiency of the clustering algorithm while not damage the clustering effectiveness.(2) We propose a new density-based local outlier concept based on uncertain data. In order to quickly detect outliers, a basic algorithm is proposed that does not require the unfolding of all possible worlds. Furthermore, we propose an estimator of the Local Outlier Factor (LOF) without calculating the probability. It can effectively reduce the amount of the calculations. Finally, the performance of our method is verified through a number of simulation experiments.(3) We propose Continuous Uncertain Outlier Detection (CUOD), which can quickly determine the nature of the uncertain elements by pruning to improve the efficiency. Furthermore, we propose a pruning approach-- Probability Pruning for ContinuousUncertain Outlier Detection (PCUOD) to reduce the detection cost. It is an estimated outlier probability method which can effectively reduce the amount of calculations. The cost of PCUOD incremental algorithm can satisfy the demand of uncertain data streams. Finally, a new method for parameter variable queries to CUOD is proposed, enabling the concurrent execution of different queries.(4) We propose a new outlier concept on uncertain data stream based on possible worlds. Then, an outlier detection method on uncertain data stream is proposed to meet the demand of limited storage and real-time processing. Next, a dynamic storage structure is designed for outlier detection on uncertain data stream over sliding window in order to meet the demands of limited storage and real-time response. Furthermore, an efficient range query method based on SM-tree (Statistics M-tree) is proposed to reduce some redundant calculation. Finally, we propose the top-n outlier detection on uncertain data stream.In this dissertation, we propose several clustering and outlier detection algorithms on account of the challenges of uncertain data. For one thing, it greatly improves the efficiency of clustering analysis over uncertain data in obstacle space. For another, our methods are also great complementary to existing outlier detection technology. Theoretical analysis and experimental results show that our methods can solve their corresponding problems efficiently and outperform previous processing methods in space complexity, processing rate and result quality.
Keywords/Search Tags:uncertain data, data mining, uncertain data stream, outlier, cluster, obstacle space
PDF Full Text Request
Related items