Font Size: a A A

Research On Distance Metrics For Nominal Attributes And Application

Posted on:2013-02-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Q LiFull Text:PDF
GTID:1118330374480455Subject:Earth Exploration and Information Technology
Abstract/Summary:PDF Full Text Request
Instance-based learning algorithms, including the nearest-neighbor learning algortihms, locally weighted learning algorithms, and memory-based reasoning etc, all depend on a good distance metric to be successful. Distance metric is the key of the distance-related machine learning algorithms. Besides, distance metrics are also used in many fields including pattern recgonition, neural networks, statistics, and cognitive psychology etc. Many distance metrics have been presented to decide the difference between two instances, such as Euclidean distance metric, Manhattan metric, Minkowsky metric, Mahalanobis metric and Camberra metric et al. However, these distance metrics only worl well for numerical attributes but do not appropriately handle nominal attributes.Compared with numerical attributes, it is a more sophisticated problem to difine an appropriate distance metric for nominal attributes. In order to give appropriate distance metrics for nominal attributes, researchers have presented some distance metrics, such as Overlap Metric, Value Difference Metric (VDM), Modified Value Difference Metric (MVDM), Short and Fukunaga Metric (SFM), Minimum Risk Metric (MRM), Entropy-Based Metric (EBM) and Frequency-Based Metric (FBM) etc.In real application, most datasets involve nominal attributes, and the tasks involved include classification, class probability estimation, and clustering etc. These tasks are important tasks in the field of machine learning and data mining. In this thesis, we study on the distance metrics for nominal attributes. There are some primary problems need to be addressed:1. How to understand the attribute independence assumption?Value Difference Metric (VDM) is a distance metric used widely to decide the difference between the nominal attribute values. According to our observation, VDM assumes the attribute independence. VDM defines the distance between two instances as the sum of the value differences across all attributes, and the attributes are independent. In fact, most distance metrics, such as the most simplest distance metric for nominal attributes:Overlap Metric, and the distance metrics for numerical attributes:Euclidean Metric and Hamming Metric, all define the distance between two instances as the sum of the value differences across all attributes. Kasif et al point that VDM assumes the attribute independence as well as naive Bayes. Although the attribute indenpence assumption is rarely held in real world, naive Bayes shows surprised classification performance, and VDM is one of the distance metrics used widely for nominal attributes. How to understand and apply the attribute independence assumption to improve the distance metrics or design the new distance metrics which are simple, comprehensible, and easily computed?2. How to express the attribute dependence in distance metrics?In real world, attribute independence is unrealistic in most datasets. However, most distance metrics assume the attribute independence as well as Naive Bayes. Although Naive Bayes shows surprised classification performance, the classification performance of naive Bayes is harmed to some extent when the attributes are strong dependence. In order to improve the performance of naive Bayes, many techniques are presented. One of these techniques is structure extension. The key idea of structure extension is to express attribute dependence by limited directed edges, and the resulted models are called augmented Bayesian networks classifiers. In recent years, many researchers have presented some augmented Bayesian networks classifiers. These augmented Bayesian network classifiers show the attribute dependence by edges among the attributes, thus dispense with its strong assumptions about independence in naive Bayes. Compared with naive Bayes, these augmented Bayesian network classifiers show better classification performance. Motivated by the success of augmented Bayesian network classifiers, we try to design more general distance metrics that take the dependence relationships among the attributes into account which will show better performance in the applications with complex attribute dependencies.3. How to accurately estimate the class membership probabilities in probability-based distance metrics?Many probability-based distance metrics need to estimate the class membership probabilities, such as Short and Fukunaga Metric (SFM) and Minimum Risk Metric (MRM). In order to make these probability-based distance metrics to be successful, the accurate class membership probabilities estimation is a key problem. Some research have shown that full estimation of the class membership probabilities is an NP-hard problem. In order to reduce the computation complexity, naive Bayes is used to estimate the class membership probabilities in distance metrics. In the way, these distance metrics's performance is violated because of the inaccurate estimation of the class membership probabilities. In fact, there are researchers prove by experiments in artificial datasets that SFM and MRM show better performance than VDM when accurate class membership probabilities is computed. However, some research have shown that the class membership probability estimation ability of naive Bayes is poor. In order to improve the class probability estimation ability of naive Bayes, researcher have presented some improved Bayesian netwok models. So we try to apply these class membership probability estimators based on improved Bayesian network models to probability-based distance metrics, consequently improve the performance of distance metrics.4. How to overcome the curse of dimensionality problem?In the thesis, we research on the distance metrics for nominal attributes. A closely realted problem with distance metrics is the curse of dimensionality problem. The curse of dimensionality problem have been noticed by many researchers. When there are plenty of redundant and (or) irrelevant attributes, the performance of learning algorithms is violated. The curse of dimensionality problem in distance metrics will result that, when there are many irrelevant attributes, if all attributes are used to decide the distance between two instances, then the distance between two instances is governed by many irrelevant attributes. As a result, the distance metric depending on all attributes will be misleading.An effective approach to overcome the curse of dimensionality problem is to weight each attribute differently when measuring the distance between each pair of instances. This approach is widely known as attribute weighting. Another drastic approach to overcome the curse of dimensionality is to completely eliminate the least relevant attributes from the attribute space when measuring the distance between each pair of instances. This approach is widely known as attribute selection. In recent years, researchers have presented many work to attribute weighting and attribute selection. In the thesis, we focus on attribute weighting and attribute selection on nominal attributes distance metrics. For example, Overlap Metric is applied widely because of its simpleness. Aim at Overlap Metric, we hope to enhance its performance by attribute weighting and simultaneously keep the distance metirc's simpleness. For another example, the well known distance metric VDM assumes the attribute independence. Then, we hope to find suitable attribute selection methods based on the attribute independence assumption.As what mentioned before, many learning algorithms in those fields of machine learning, pattern recognition, neural networks, stastic, and cognitive psychology are distance-related algorithms, and their performance all depend on the used distance metrics. For example, k-nearest neighbor (KNN) learning algorithm and its variant distance weighted k-nearest neighbor learning algorithm (KNNDW), locally weighted naive Bayes algorithms (LWNB) et al. are all distance-related algorithms. So, we will apply these new distance metrics to improve these distance-realted algorithms.In view of aforementioned some problems, in the thesis we study on the distance metrics for nominal attributes, and improve the distance metrics for nominal attributes from different perspectives. The major work include:1. The attribute independence assumption in distance metrics is investigated.The attribute independence assumption is a crucial assumption to the naive Bayesian classifier. Although the assumption is rarely held in real world, the naive Bayesian classifier shows surprised performance. In fact, many distance metrics also assume the attribute independence, such as the well known Value Difference Metric (VDM). Short and Fukunaga Metric (SFM) is another widely used metric, which doesn't assume the attribute independence. In order to improve the performance of SFM. In the2nd chapter of the thesis, we propose a Modified Short and Fukunaga Metric (MSFM) based on the attribute independence assumption. MSFM is surprisingly similar to VDM in terms of their expression forms and their computation complexities. Our experiments have shown that the performance of MSFM is much better than SF2LOG (another improved inversion of SFM) and SFM, and is competitive with VDM. 2. The attribute dependence realtionships is expressed in distance metrics.The augmented Bayesian network classifiers express the attribute dependence relationships by the limited directed edges, consequently result in the more better performance than the naive Bayes classifier. The Value Difference Metric (VDM) is one of distance metrics widely used to decide the distance between each pair of instances with nominal attribute values only, and also assumes the attribute independence. In the3rd chapter of the thesis, we investigate in detail the performance of the naive Bayes classifier and the augmented Bayesian network classifiers by experiment and theory analysis, and single out an improved Value Difference Metric by relaxing its unrealistic attribute independence assumption which is called One Dependence Value Difference Metric, simply ODVDM. In our ODVDM, the structure learning algorithms for Bayesian network classifiers are used to find the dependence relationships among the attributes. Our experimental results on some datasets which have strong attribute dependence relationships validate our distance metric's effectiveness.3. The class membership probabilities estimation in probability-based distance metrics is improved.When we apply probability-based distance metrics SFM and MRM to distance-related learning algorithms, a key problem is how to accurately estimate the class membership probabilities. For simplicity, existing works use naive Bayesian classifiers to estimate class membership probabilities in SFM and MRM. However, it has been proved that the class membership probabilities estimation ability of naive Bayesian classifiers is poor, which results that SFM and MRM do not perform well. In the4th chapter of the thesis, we study on the class probability estimation performance of some augmented Bayesian network classifiers and then apply them to estimate the class membership probabilities in SFM and MRM. Our experimental results on a large number of UCI datasets show that the use of more accurate class probability estimation algorithms can improve the performance of SFM and MRM.4. The technique of attribute weighting is used to improve distance metrics.Attribute weighting is an attractive way to improve the accuracy of a distance metric. In the5th chapter of the thesis, we focus on the simplest distance metrics for nominal attributes: Overlap Metric, and use attribute weighting to improve the performance of Overlap Metric. Among large numbers of distance functions, HEOM is the simplest distance metric to handle application with both continuous and nominal attributes. Further, we present an improved distance metric:Correlation Weighted Heterogeneous Euclidean-Overlap Metric (CWHEOM). In CWHEOM, to discrete and continuous class problems, we apply different correlation functions to estimate the weight values of attribute variables. The improved distance metric significantly raises the generalization performance of HEOM, and at the same time keeps the simplicity and comprehensibility of the distance metric. Experiments results on36discrete class datasets and36continuous class datasets prove that our new method achieve higher accuracy than HEOM.5. The technique of attribute selection is used to improve distance metrics.In the previous chapters, we apply distance metrics to distance-related learning algorithms to deal with the classification tasks. In real-world applications, accurate class probability estimation is required instead of just a classification. Probability-based classifiers also produce the class probability estimation. In the6th chapter of the thesis, we focus on the class probability estimation performance of KNN and KNNDW using VDM. We try to use the attribute selection method to improve the accuracy of VDM, consequently improve the class probability estimation performance of KNN and KNNDW. The key question is what kind of attribute selection method is suitable for VDM. According to our observation, VDM assume the attribute independence as well as NB, so our idea is that the attribute selection methods for VDM should select attribute subsets among which the attribute independence assumption is held as possible. In this perspective we propose to use the attribute selection methods based on CFS and SBC-CLL to select attribute subsets for VDM. Experimental results prove our attribute selection methods significantly improve the class probability estimation performance of KNN and KNNDW using VDM.6. We apply the distance metrics presented in the thesis to distance-related learning algorithms to deal with some practical tasks in the fields of geophysics and engineering.In the experiments presented in every chapter, we all run our experiments on many UCI datasets (http://archive.ics.uci.edu/ml/datasets.html). Besides, we take the problems in the fields of geophysics and engineering, such as reservoir porosity prediction, gas emission prediction, rockbursts prediction, and slope stability prediction to study on the practical apply value of these new distance metrics presented in the thesis.In sum, we study on thoroughly and systematically the distance metrics for nominal attributes in the thesis. Moreover, we study on distance metrics associated with Bayesian networks. Bayesian networks model is an excellent learning model, and it seems to be unrelated with distance metrics problem. In the thesis, we apply the way of the attribute dependence relationship in Bayesian networks to study on the distance metrics, and transform the distance measure problem into the structure learning problem of Bayesian network classifiers. We also investigate exsiting class probability estimators in detail and apply them to improve the estimation of calss membership probabilities. But there are some problems needed to be resolved if we want relate Bayesian networks with distance metrics.The main innovations in the thesis include:1. As we all know, the naive Bayes classifier assumes the attribute independence. In the thesis, we investigate the attribute independence assumption in distance metrics in detail and present improved distance metrics.2. We express attribute dependence relationships in distance metrics, and transform the distance measure problem into the structure learning problem of Bayesian network classifiers.3. We investigate the exsting work about the class membership probability estimators, and apply the class membership probability estimators based on Bayesian networks to improve the probability estimation in distance metrics.
Keywords/Search Tags:distance metrics, nominal attributes, attribute independence assumption, attributedependence realtionships, class probability estimation
PDF Full Text Request
Related items