Font Size: a A A

Research On Generalzed Mahalanobis Distances And Its Application In Data Mining

Posted on:2013-06-24Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2248330377456679Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Distance, as a measure, is widely used in the field of scientific research and engineering technology. For example, the similarity in Clustering can be measured by the distance; the matching in pattern recognition can be measured by distance; the filter criteria in information security also can be measured by distance and so on. But the distance hasn’t got a fast development after the European geometry was founded in3rd AD. Nowadays, data mining boom sweeps again with the rapid development of e-commerce, and the research of a new distance which can conquer quondam disadvantages and be more suitable for data mining has great significant.This paper focuses on data mining, developing a new distance which can reflect more information between different data. It would be applied in traditional data mining, rising uncertain data mining and distributed data mining. Academic analysis and simulations prove its advantages.Main contents are as follows:1. Summarize the distance calculation in data mining. The Euclidean distance, Manhattan distance, Mahalanobis distance etc are frequently used in data mining as a similarity measure. This paper analyzes their advantages and disadvantages to sustain the following new distance follows.2. MP Mahalanobis distance is proposed which is applied in missing data imputation. The Mahalanobis distance which can’t be influenced by the dimension fully considers the relativities in data. But in some situation, it may not exist. The MP Mahalanobis distance based on singular value decomposition and Moore-Penrose inverse can solve this problem. After improving the Multiple Correlation Coefficient, the MP Mahalanobis distance and entropy is used in missing data imputation. Simulation proves MP Mahalanobis distance not only exists in any data sets, but also has better accuracy than the Mahalanobis distance.3. WMP Mahalanobis distance is proposed and applied to clustering in this paper. Although MP Mahalanobis distance exists in any data set, the data correlation embodied in it is too objective, which may lead to wrong information, even a bad results of data mining. According to the theory of spectral decomposition of the real symmetric matrix and the weighted Moore-Penrose inverse, we offer the WMP Mahalanobis distance. After making simulation with classical clustering algorithms, the result shows that the accuracy of WMP Mahalanobis distance in reflecting data correlation has greatly improved.4. A new framework for uncertain data mining is proposed. In a general data mining process, many factors contribute to the uncertainty of a data set:the inaccuracy of raw data itself, the uncertainty brought by data pre-processing, the data integration, and so on. However, the common data mining are for deterministic data. We offer a new framework for uncertain data based on probability dimensions. We also construct an instance of it and analyze the aspect with combining the WMP Mahalanobis distance.5. A new distributed Bayes prediction method is proposed. Page ranking algorithm arouse widen attention in the area of e-commerce, and the real-time TB data ranking in a distributed environment is also a problem need to be studied. We improve the naive Bayes method, proposed an offline data filtering method that could filter out the irrelevance data in advance, thus reducing the time required for real-time ranking. At last, we analyze the prospect of WMP Mahalanobis Distance in distribute data mining.
Keywords/Search Tags:Mahalanobis Distance, Missing Data Imputation, Clustering, SVD, UncertainData, Bayes
PDF Full Text Request
Related items