| With the progress of data mining technology and the popularity of big data analysis platform like Hadoop and Spark, the difficulty of analysis of large data sets has been significantly reduced, and the data quality is obviously improved, therefore the small number of outliers in huge amount of data is no longer worthless. For example, the police office is more concern about the outliers of criminal or may happen crime than finding the common patterns of data, and some application with high social value is tend to be ignored, like using the outlier detection to help Banks detect fraudulent trading and help companies find anomaly drugs. Using traditional outlier detection techniques combining emerging data mining technology to explore the implicit outlier pattern has the very high research value.Outlier mining algorithm based on clustering is the most mainstream of outliers mining algorithm at present, however it has two difficulties. First, it is not only limited by related clustering algorithm in performance, but it has high computational complexity; the second difficulty is to defining the metrics of outliers, because there isn’t a flexible metrics for users to consider and discuss. To solve these problems, foreign scholars propose the method of using the outlier factor to demonstrate degree of outlier, but outlier factor like LOF or LDOF although has advantages like High stability and high accuracy, it also has the disadvantage of high computing complexity. This paper proposed CFLDOF-algorithm which use CF-tree to optimize LDOFalgorithm, based on pruning the data set to optimize LDOF-algorithm. In this paper, experiments confirmed that CFLDOF-algorithm optimized LDOF-algorithm’s computation time and it also has a similar accuracy with LDOF-algorithm. In addition to, based the idea of parallel algorithm this paper improved CFLDOF-algorithm to make it run on Spark. The main work is as follows:1) Proposed CFLDOF-algorithm which use CF-tree to optimize LDOF-algorithm, then proposed the CFLDOF algorithm.2) Comparing experiment approve that CFLDOF not only optimize LDOF algorithm on the time complexity, also has the similar accuracy to LDOF algorithm;3) Proposed parallel design of the CFLDOF algorithm, also provide the pseudo code of realize the CFLDOF algorithm based on Spark platform.Combining with the paper work, can get the conclusion that, CFLDOF algorithm can optimize the computational complexity of the LDOF algorithm, and it has the similar accuracy with LDOF algorithm, using CF-tree to optimize LDOF-algorithm is correct。... |