Font Size: a A A

Research On The Similarity Measure Of Discrete Data Based On Conditional Probability

Posted on:2019-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:B R ZhengFull Text:PDF
GTID:2428330566986427Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
The distance or similarity of two instances plays an important role in data mining and machine learning.It is widely used in machine learning algorithms such as classification,clustering,anomaly detection,feature selection,and instance retrieval.The distance of continuous data is very mature,and the similarity of discrete data has significant research significance.Many data-driven similarity measurement methods use the data set to obtain the distribution of attribute values and construct measure functions from the perspective of frequency,probability,and information entropy.Taking into account the class information of the discrete data with class labels has a guiding role in the training of the learner.In this paper,the similarity measure function is constructed by using the class conditional probability of the attribute value,and discussed separately on the unordered and ordered discrete attributes.The main research contents are as follows:(1)Propose the similarity measure of disordered discrete attribute based on conditional probability.This measurement method uses the conditional probability of the attribute value combined with the information entropy theory,and takes the ratio of the common information of the two instance objects to the total information amount of the two instances as its similarity.Applying it to multiple data sets,the experimental results show that the learner under this measure has lower error rate.(2)Propose the similarity measure of ordered discrete attributes based on conditional probability.For the order relationship of the attribute values,the measure makes the similarity of neighboring values of the order relationship larger;on the contrary,the similarity of the value of the farther order relationship is smaller.Combining it with the measurement method proposed in(1),and applied to multiple data sets mixed with ordered and unordered discrete attributes.The experimental results show that it has better performance.(3)Apply the similiarity measures proposed in this paper to data attributes that include orderly and unordered microcredit user application qualification data set,and compare it with other commonly used similarity measures under the experimental test results of the data set.The experimental results show that the measure methods presented in this paper perform better on various performance evaluation indicators,which indicates that it has a certain of effectiveness.
Keywords/Search Tags:Similarity measure, Discrete data, Conditional probability, Information entropy
PDF Full Text Request
Related items