Font Size: a A A

Research On Classification Over Uncertain Data

Posted on:2013-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:H M ChenFull Text:PDF
GTID:1118330374959495Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The study of uncertain data has attracted many attentions with the rapid development in uncertain data gathering and processing. Uncertainty challenges data modeling, data management and data mining. Uncertain data mining is a new research direction and is more challenging than traditional data mining due to the underlying uncertainty in data. On the other hand, classification is an important task in data mining, and can be applied in various fields.Therefore,it is desirable to develop effective and efficient classification methods over uncertain data.In this thesis, three kinds of basic classification methods,which are the nearest neighbor classification, the naive Bayesian classification and the basic decision tree classification, over two levels of uncertain data, which are uncertain data with exact confidence values/probabilities and uncertain data missing confidence values/probabilities, are investigated. The results in this thesis are helpful to enrich the theory and technologies of uncertain data mining, and to enlarge the application fields of classification over uncertain data.The main contributions of this thesis can be summarized as follows:(1)The nearest neighbor classification over uncertain data with exact probabilities is studied. The presented methods reduce the time complexity of the nearest neighbor classification over value-uncertain continuous objects,and improve the accuracy of the nearest neighbor classification over value-uncertain discrete objects.①For value-uncertain continuous objects,the expected distances between objects are defined, the expected squared distances are adopted to evaluate the expected distances, and a formula to effectively compute the expected squared distances is given in order to reduce the time complexity. Under certain conditions, the accuracy of the expected squared distances is the same with that of the expected distances, with the lower time complexity.②For value-uncertain discrete objects,in order to improve the accuracy, the expected semantic distances between objects are defined by the orders or the concept hierarchy trees from semantic point of view, and the strategies of indexing and pruning are used to effectively compute the expected semantic distances.The accuracy of the expected semantic distances can be improved if the semantic distances can be defined reasonably, with the accepted time complexity.The nearest neighbor classification over value-uncertain objects can be used to classify certain objects because the expected distances, the expected squared distances and the expected semantic distances can be immediately applied to certain objects.(2) The naive Bayesian classification over uncertain data missing probabilities is studied. Based on the theory of interval probability, the naive Bayesian classification is extended to the naive Bayesian classification with interval probability parameters which can handle both value-uncertain discrete objects and certain discrete objects.①The interval probabilities of value-uncertain discrete objects are defined from probabilistic cardinality point of view, and it is proven that the interval probabilities are F-probabilities in the theory of interval probability.②Based on the theory of interval probability, the conditional interval probabilities, including the intuitive concept and the canonical concept, of value-uncertain discrete objects are defined, and the independence and the conditional independence of the intuitive concept are defined. Further, a formula to effectively compute the intuitive concept is given.③The naive Bayesian classification with interval probability parameters over value-uncertain discrete objects is presented, in which the intuitive concept is used as the posterior interval probability and the conditional interval probability, and the canonical concept is used to reconstruct the joint interval probability in order to compute the posterior interval probability.The naive Bayesian classification with interval probability parameters can handle both value-uncertain discrete objects and certain discrete objects because certain discrete objects are special cases of value-uncertain discrete objects, and the theory of interval probability generalizes the theory of classic probability.(3)The basic decision tree classification over uncertain data missing probabilities is studied. Based on the reachable probability intervals, the basic decision tree classification is extended to handle both value-uncertain discrete objects and certain discrete objects, which assigns objects to the branches with probability intervals.①The probability intervals and the conditional probability intervals of value-uncertain discrete objects are defined from the interval probabilities and the intuitive concept of the conditional interval probabilities point of view, and it is proven that the probability intervals and the conditional probability intervals are the reachable probability intervals.②Based on the reachable probability intervals,the entropy intervals and the conditional entropy intervals of value-uncertain discrete objects are defined.The upper bound of the entropy interval is the maximum of the entropies of the reachable probability intervals and the lower bound is the lower bound of the entropies.A method to compute the upper and the lower bounds of the entropy interval is given.③The basic decision tree classification over value-uncertain discrete objects is presented, in which the binary decision and the test based on the set are adopted, and the conditional entropy intervals are used to select the best attribute.The classic probability is a special case of the reachable probability interval and certain discrete objects are special cases of value-uncertain discrete objects,so the presented basic decision tree classification can handle both value-uncertain discrete objects and certain discrete objects.
Keywords/Search Tags:uncertain data, classification, expectation, theory of interval probability, reachableprobability interval
PDF Full Text Request
Related items