Discretization Of Normal Distribution And JS Divergence In Decision Trees

Posted on:2024-08-05

Degree:Master

Type:Thesis

Country:China

Candidate:J M Chen

Full Text:PDF

GTID:2568306938959119

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The decision tree method is a data classification and mining method based on information theory,which is widely used to solve classification problems.Its basic idea is to establish a tree by decision rules for a known training dataset and then use the constructed tree to predict new data.Many decision tree algorithms have been proposed to analyze known classification information and generate a predictive model.In most decision tree models,decision rules such as information gain,information gain ratio,Gini coefficient,median,and mean are used,assuming that the data features are mutually independent.Therefore,this thesis proposes a feature selection method that considers the interaction between features.Currently,several decision tree methods cannot directly handle continuous data.However,most existing discretization algorithms for continuous data are based on the assumption that there are no outlier or noisy data in the dataset,and that the discretization algorithms are not affected by such data.In contrast,the classification accuracy of decision tree classifiers is genuinely affected by outliers and noisy data.To mitigate the effect of noisy data,this thesis uses normal distribution to discretize continuous data,distinguishing outlier values and generating more significant discrete data to improve the classification accuracy.Specifically,the samples are discretized equally based on the normal distribution to reduce the impact of noise samples,providing an excellent dataset foundation for the Jensen-Shannon divergence decision tree algorithm.Furthermore,this thesis compares the discretized data using the proposed method with those obtained from equal interval discretization,equal frequency discretization,and binarization methods,respectively,using Naive Bayes,Support Vector Machine(SVM),ID3,and CART decision trees.The results show that the proposed discretization method is robust to noisy data and achieves higher classification accuracy.The decision tree models classify instances by modeling their feature values.The feature selection strategy is a crucial part of decision tree algorithms.Previous decision tree algorithms were based on a greedy search strategy that assumes features are independent of each other,often leading to suboptimal solutions.This thesis proposes a new feature selection method called the Jensen-Shannon divergence feature selection method,which generates a decision tree called the Jensen-Shannon divergence decision tree(JSDT).This method uses the values in the Jensen-Shannon divergence matrix as a new measure to select features and determine the optimal split feature set during tree growth.The purpose of using this special feature selection strategy is twofold:(a)the decision tree can avoid spending a lot of time finding a useful classification feature from the first major focus by starting from the features of the original dataset;and(b)the decision tree generation process can find useful features for classification more quickly and efficiently than the general exhaustive method.JSDT was experimentally analyzed on 13 datasets and compared with traditional decision tree classifiers such as ID3,C4.5,and CART.The results show that JSDT can effectively find features with interactions and is less costly than traditional methods.In summary,this thesis proposes methods to address the interaction between data features and improve the classification accuracy of decision tree models.The proposed discretization method is robust to noisy data,and the JSDT feature selection method is more efficient than traditional methods.The decision tree method is a widely used data classification and mining method based on information theory for solving classification problems.The basic idea is to build a tree of decision rules using a batch of known training data and use the built tree to predict data.Many decision tree algorithms have been proposed,with decision rules such as information gain,information gain ratio,Gini coefficient,median,and mean used in most models,all based on the assumption that data features are independent of each other.This thesis proposes a feature selection method that considers the interaction between features and a new discretization method suitable for discrete data.

Keywords/Search Tags:

decision tree, information gain, Gini coefficient, JS divergence, feature selection strategy

PDF Full Text Request

Related items

1	Information Metrics And Decision Trees Based On Neighborhood Equivalence Relation
2	The Feature Selection Based On Mutual Information And Decision Tree
3	Research On Model Decision Tree Method
4	The Improvement Of Complete Decision Tree Based On The Information Gain Theory
5	Using Feature Importance And Decision Tree To Analyse Examination Score
6	The Research On Application Of Improved ID3 Of Decision Tree Classification Algorithm In Management Of Students' Grades
7	The Research On The Algorithms Of Optimizing Decision Tree Classification
8	Decision Tree Learning Based On General Entropy And Unstable Cut-points
9	Rearch On Speech Emotion Recognition Technology Based On Feature Selection And The Decision Tree SVM
10	Research On Feature Selection Algorithms Based On Decision Tree For High-dimensional Data