Font Size: a A A

Research On Data Preprocessing Methods Based On Clustering And Outlier Detection

Posted on:2013-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:R H MiaoFull Text:PDF
GTID:2248330395467844Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network information, many enterprises have accumulated unprecedented mass data. How to acquire the information they need from the vast amounts of data and then apply it to the decision of self-development has become an urgent research problem. Faced with this challenge, data mining technology came into being. However, there are many errors existing in the real data, due to the collecting or typing errors etc. These problems may affect the results of the data mining to a large extent. Therefore, it is very important to apply data preprocessing techniques to improve the data quality. At the same time, as two very hot research directions of data mining, clustering and outlier detection have attracted more and more attentions from the public. This paper mainly analyses the relationship between the clustering and outlier detection methods and the requirements of data preprocessing, and then researches the corresponding data preprocessing methods.Firstly, this paper introduces the overall structure of the data mining oriented data preprocessing system, which is implemented by dividing the data preprocessing into six parts according to all the tasks in data preprocessing with its application and research. Then, it describes the start of the system and the task of each function module, especially the data cleaning module and the instance detection module in detail.Secondly, this paper analyses how to use the clustering method to realize the task of noises processing in data cleaning of the preprocessing system. Then it introduces the basic concept of clustering, the classification and the requirements of the clustering algorithms in detail. Also, it describes the two classic clustering algorithms implemented in the system and makes the steps of the algorithms clear with examples. Thus, an improved algorithm used for noises removal is put forward according to the two classic clustering algorithms described above and the experimental results showed that the clustering results have been improved.Finally, this paper describes the implemented outlier detection function in instance detection of the system, and makes a detailed research on outlier detection, including its basic concept, classification of the algorithms, evaluation methods and so on. At the same time, it also throws light on the two implemented outlier detection algorithms in the system and improves the one that based on a simple pruning strategy. At last, as it is shown in the experiments, the improved algorithm not only has the same accuracy with the original algorithm, but also is more efficient than the original one.
Keywords/Search Tags:Clustering, Outlier Detection, Data Preprocessing, Data Mining, DataCleaning, Noises Processing
PDF Full Text Request
Related items