Font Size: a A A

Research And Application On Outlier Data Mining Algorithm In Large Data Set

Posted on:2012-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:M ZhengFull Text:PDF
GTID:2178330338997790Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is a process of knowledge discovery from huge amount of numbers, hence, it became a hot topic in many fields. The main purpose of data mining is to find out the potential and useful information which is hidden and prior ignorant from a large numbers of uncompleted and noisy applications. Outlier is such data as an obviously departure from other data and does not satisfy the common patterns or actions.Outlier may be"noisy data", but it also can be of meaningful information corresponding to the reality. From the Point of knowledge discovery,rare events are often more valuable than others in many domains. As a result,outlier mining is an important and meaningful research in data mining. Outlier data mining has been used in many areas, such as stock market, telecommunication, finance, intrusion detection, weather forecast and so on. Outlier data mining includes two parts which are outlier detection and outlier data analysis. In this dissertation, we discuss the most pivotal question outlier detection. Based on the study of advantages and disadvantages of several outlier detection algorithms, we proposed our double-cluster based algorithm, and verify the accuracy and efficiency of the algorithm through experiments both on synthetic and real life data sets. The main contribution of this dissertation is as follows:Specific work of this thesis is described as follows:1. Explain the study status of data mining at home and abroad and the research significance, the process of data mining and the relationship between data mining and the data warehouse. Comprehensively analyze the existing data mining algorithms, study their advantages and disadvantages, applicable scope, etc.2. Based on detailed analysis of two KNN based data mining algorithms, we put forward our algorithm——the double-clustering based KNN outlier detection algorithm. Through experiments on synthetic data set we prove that the double clustering based algorithm is accurate and more effective compared with the original algorithm.3. We apply the double clustering based algorithm to a BBS registered user behavior analysis, the final results show the accuracy of the proposed algorithm.We evaluate the performance of the double clustering based algorithm through experiments both on synthetic data set and real data set. The data resources are from the UCI machine learning libraries, data generator and the registered users'information of BBS of a navigational web site. Experiments results show that this algorithm is very accurate and efficient and achieved satisfactory results.
Keywords/Search Tags:Data mining, outlier detection, KNN, clustering, partition
PDF Full Text Request
Related items