Research And Application On Outlier Data Mining Algorithm In Large Data Set

Posted on:2012-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:M Zheng

Full Text:PDF

GTID:2178330338997790

Subject:Computer software and theory

Abstract/Summary:

Data mining is a process of knowledge discovery from huge amount of numbers, hence, it became a hot topic in many fields. The main purpose of data mining is to find out the potential and useful information which is hidden and prior ignorant from a large numbers of uncompleted and noisy applications. Outlier is such data as an obviously departure from other data and does not satisfy the common patterns or actions.Outlier may be"noisy data", but it also can be of meaningful information corresponding to the reality. From the Point of knowledge discovery,rare events are often more valuable than others in many domains. As a result,outlier mining is an important and meaningful research in data mining. Outlier data mining has been used in many areas, such as stock market, telecommunication, finance, intrusion detection, weather forecast and so on. Outlier data mining includes two parts which are outlier detection and outlier data analysis. In this dissertation, we discuss the most pivotal question outlier detection. Based on the study of advantages and disadvantages of several outlier detection algorithms, we proposed our double-cluster based algorithm, and verify the accuracy and efficiency of the algorithm through experiments both on synthetic and real life data sets. The main contribution of this dissertation is as follows:Specific work of this thesis is described as follows:1. Explain the study status of data mining at home and abroad and the research significance, the process of data mining and the relationship between data mining and the data warehouse. Comprehensively analyze the existing data mining algorithms, study their advantages and disadvantages, applicable scope, etc.2. Based on detailed analysis of two KNN based data mining algorithms, we put forward our algorithmâ€”â€”the double-clustering based KNN outlier detection algorithm. Through experiments on synthetic data set we prove that the double clustering based algorithm is accurate and more effective compared with the original algorithm.3. We apply the double clustering based algorithm to a BBS registered user behavior analysis, the final results show the accuracy of the proposed algorithm.We evaluate the performance of the double clustering based algorithm through experiments both on synthetic data set and real data set. The data resources are from the UCI machine learning libraries, data generator and the registered users'information of BBS of a navigational web site. Experiments results show that this algorithm is very accurate and efficient and achieved satisfactory results.

Keywords/Search Tags:

Data mining, outlier detection, KNN, clustering, partition

Related items

1	Research And Application Of Outlier Detection Algorithm
2	Study On Outlier Mining Algorithms Based On Clustering
3	Study On The Algorithms Of Clustering And Outlier Detection Based On Neighborhood
4	Study And Implementation Of Clustering And Outlier Detection Algorithms
5	Study Of Clustering And Outlier Detection Algorithm In Data Mining
6	Study On Distance-Based Outlier Mining Algorithm
7	Research On Data Preprocessing Methods Based On Clustering And Outlier Detection
8	Outlier Mining Algorithm Research And Application
9	Study On Local Outlier Detection Algorithm Based On Muti-clustering
10	Study On Outlier Detection And Clustering Of Moving Trajectories