Font Size: a A A

The Research On Discretization Oriented To Na(?)ve Bayes Algorithm

Posted on:2009-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z J XieFull Text:PDF
GTID:2178360242989485Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the domain of data mining, many algorithms can only handle qualitative attributes. Na(?)ve Bayes learning algorithm assumes that the values of a quantitative attribute are distributed in a normal distribution when processing quantitative attributes. However, this assumption is always violated in real-world datasets, so the performance of Na(?)ve Bayes is severely restricted. Therefore, a preprocess step called discretization is always required in the use of classification. Discretization is an important aspect of data mining which is a data transformation process from qualitative to quantitative data. Discretization can not only improve the accuracy and efficency of the classification, but also enable more data mining algorithms to be applied to datasets which contain quantitative attributes. So discretization is with important realistic significance and research value.In this paper, we first classify data. Then we expound relevant theories about data mining and classification, and we also have a deep research in Na(?)ve Bayesian algorithm and the method to deal with quantitative attributes in Na(?)ve Bayes classifier. Secondly, we analyze the current research situation about discretization, then we point out especially that discretization is effective for Na(?)ve Bayes. We propose a new MDL-based discrete method named Multi-EMD after researching EMD method and MDL principle. Multi-EMD is a multi-variant and supervised discrete method. Multi-EMD algorithm has the same way of finding out the best cut point by finding the minimum cut point entropy which is used in EMD algorithm, and it uses a multi-variant MDL principle to evaluate a cut point by taking the effect of all numeric attributes in the dataset into account, so it leads to a more reliable evaluation. Moreover, we study another unsupervised discrete method called PKI in great depth. Then, we combine EMD and PKI methods by using EMD's method to search for the best cut point and using PKI's method to calculate the quantity of discrete intervals, and this work lead to PEMD algorithm. Finally, the comparison among these methods is conducted in Weka platform and the experimental result shows that the Multi-EMD algorithm has a better performance than EMD algorithm, and PEMD method is better than EMD and PKI.
Keywords/Search Tags:Data Mining, Discretization, Classification, Na(?)ve Bayes, MDL
PDF Full Text Request
Related items