Font Size: a A A

Study Of Clustering And Outlier Detection Algorithm In Data Mining

Posted on:2009-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:L C YangFull Text:PDF
GTID:2178360245995626Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the wide usage of information technology, data generated from different information systems become more and more. How to utilize the huge original data to analyze current situation and predict future of quantities effectively, have already become a great challenge that the mankind has faced. Therefore the data mining technology arises at the historic moment and can be developed rapidly, which is attributed to the necessary consequence of the conflicting movement between the rapid increasing data and the poor information day by day.Data Mining, also called as knowledge discovery of databases (KDD), is a processing procedure of extracting credible, novel, effective and understandable patterns from databases. Data Mining is a relatively young research and application area based on database techniques, which synthesizes multidisciplinary productions, such as logic statistics, machine learning, fuzzy theory and visual computing, in order to acquire usable information from database .It has achieved increasing attention in the past years, and has been applied to finance, insurance, communal facilities, government, education, telecommunication, software development of the bank, transporting, etc.Clustering analysis is an important technology in data mining. Clustering, an unsupervised classifying method is the process of grouping together similar multi-dimensional data vectors into a number of clusters or bins. Clustering processes are always carried out in the condition with no pre-known knowledge, so the most research task is to solve that how to get the clustering result in this premises. Most researches about clustering are focused on clustering algorithms; the main purpose is to produce practical algorithms with better performance. Up to now, many clustering algorithms have been presented, but these algorithms are only suited special problems and users. Furthermore, they are imperfect both theoretically and methodologically, even severe fault. Optimizing deeply clustering algorithms will not only help to perfect its theory, but also its popularization and application.This dissertation systematically, deeply, roundly and detailedly studies and analyses the technique and methods of clustering analysis, puts forward an improved clustering algorithm based on rough set theory, considering the fault of partition-based clustering algorithm. The improved clustering algorithm resolves the problems that the number of clusters cannot be set exactly and can only find clusters with spherical shape, making partitioning method be able to discover clusters with arbitrary shape.Most real-world databases contain noisy or outlier data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. So, this dissertation systematically, deeply, roundly and detailedly studies and analyses the technique of distance-based outlier detection. Considering the fault of the algorithm in performance and precision, defined a new dissimilarity function to measure the degree of the outliers, which considered as the fitness function of Genetic Algorithm, and then, proposed an improved outlier detection algorithm based on Genetic Algorithm. In the approach, what the user should do is nothing but specifying the number of outliers, which reduces the task of users and minimizes the impact of the outside world. Extensive experiments on synthetic and real data showed that the algorithm is correct and valid, and performs better any than other outlier detection algorithms in performance.
Keywords/Search Tags:Data Mining, Clustering Analysis, Outlier Detection, Rough Set Theory, Genetic Algorithm
PDF Full Text Request
Related items