Font Size: a A A

Outlier Detection Methods For Complex Data Types

Posted on:2015-04-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1228330422981633Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Outlier detection is one of the key tasks of data mining, the main objective of it is to taprare data points and it s generation model or mechanisms and support the in-depth analysisand understanding of data. The outlier data points Often contains important meaningfulinformation, Requires a combination of data mining, data analysis, big data and other theoriesand techniques to further explore and study. In recent years, there are more and moreapplication demand in many fields, such as detection of credit card fraud, stock insider trading,network intrusion, healthcare, military reconnaissance and protection of critical systems.With the widespread popularity of modern network technology and mobile applications,we produced and stored vast amounts of high-dimensional data, uncertain data, stream dataand other unstructured complex types of data. Faced with these explosive growth of complexdata sets, how to effectively dig out the hidden outlier data points and analyze the mechanismbehind it is a challenging task. This paper focus on the discussion of presenting problems ofuncertain data and high-dimensional data in outlier detection tasks and undertakes series ofexperiments and verifications. The main contains and achievements are as follows:1) For the issues of increasing the probability dimension of uncertain data which leads toblurred results of data storage, processing, and display, we proposed outlier detection basedon local information model (ULOF): Combining data points local level of uncertainty and thelocal density information to calculate the uncertainty local outlier factor of each data points inuncertain data sets (ULOF). This model effectively utilized the degree of uncertaintyinformation of data to mine the outliers in data sets, using the least squares fitting high-orderpolynomial to represent the probability density function of distance values between the datapoints, generalized the definitions, concepts and formula expression of classic LOF algorithm.In order to optimize the calculation we1) use dynamic programming method to assess Po(k_d)(the probability that there are at least k neighbors of point o in k_d distance) withinpolynomial time, avoid the difficulty of index level time complexity.2) EstimatesK-η-distance s possible value range as narrow as possible, and using iterative algorithm tofurther calculate its exact value within the interval.3) use pruning strategies to reduce the sizeof the candidate set of data point neighborhood. We evaluated the performance of thealgorithm on several artificial and real data sets, compared it with the most advanceduncertain data outlier detection techniques. Experimental results show that the ULOFalgorithm has obvious advantages in detection accuracy and time performance. 2) For the problems like “dimension disaster” which leads to distance measure can notexpress the original physical meaning and the low computational efficiency of establishingoutlier mining model directly in full dimension space, we proposed the high-dimensionaloutlier detection model which take number of neighbors to the relevant local weight varianceas the cosine of the angle factor outlier data points (LW_ABOD). By increasing local datapoint information weights, it c an effectively avoid misjudge the data point of the edge ofcluster as outlier and missing outliers between different clusters. In order to optimizeexecutive efficiency, we1) use the random projection technology to project the original dataset onto low-dimensional space, build Index structure in the new space and search for-neighbor of data points;2) use forms like cumulative calculation (ACC-ABOF) to optimizethe original Multi-loop method to calculate the angle variance, the time complexity can bereduced nearly one order of magnitude;3) further propose incremental calculation model(FU_ABOD), avoid the need to recalculate the entire data set when new data is inserted.Multiple experiments and comparatives confirmed that the algorithm is very suitable forstream data and applications having real-time requirements.3) For the issues like very sparse high-dimensional data distribution and the exponentialgrowth of sub-space number and number of dimensions, we propose a method to taphigh-dimensional outlier data points In a number of related sub-spaces(RSub). The core of theapproach is to infer the dependency between attributes and the relationship between propertyassessments by the orthogonal eigenvectors and characteristic roots of the covariance matrixbetween corresponding attributes. In contrary to finding the main ingredients represented bythe largest variance, redundancy between attributes represented by eigenvector correspondingto the smaller characteristic roots is relatively high, the correlation is relatively strong. Thedifference of strength of sub-space correlation and the number of dimensions influences thetest results in different degree1) The stronger the sub-space correlation the greater effect inmining outlier data, therefore subspace correlation strength can be calculated as the degree ofoutlier weighting factor;2) The larger the sub-space dimension the more sparse thedistribution of data, so the values of neighbors distance need to be adjusted according to thenumber of dimensions, so that the different dimensions of the subspace calculation results arecomparable. The RSub algorithm can find the related sub-space directly in polynomial time,avoid traversing the index magnitude subspace, experimental results show that in the sense oftime-efficient it is far superior to other subspace-based outlier detection algorithm.
Keywords/Search Tags:Outlier Detection, Uncertain Data, High Dimensional Data, DataMing, SubSpace, Angle-Variance, Local Information
PDF Full Text Request
Related items