Font Size: a A A

Research On Classification Problem Based On Large Amount Of Inconsistent Data

Posted on:2018-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:H P WangFull Text:PDF
GTID:2348330533969814Subject:Computer science and technology
Abstract/Summary:PDF Full Text Request
In recent years,with the increasing amount of data in real life,the occurrence of inconsistent data becomes more frequent.The traditional method is to repair and correct inconsistent data by manual correction.However,with the exponential increase in the amount of inconsistent data,it is becoming more time-consuming to manually correct inconsistent data.Moreover,as the amount of data increases,there are inevitable human errors in manual data modification,which results error data appears in the data.Therefore,the correction method is no longer feasible.It is the core research content of this paper that how to make the inconsistent data without manual modification,and feature selection and classification directly on inconsistent data.The decision tree algorithm is a better classification algorithm.It has better fault tolerance for error data and outlier data.And it for the tree structure after the model has better interpretability,can directly see the subset of data classification,this paper choose the decision tree algorithm to improve.The mutual information algorithm is calculated by influencing factors of individual feature and target feature,which is the measure of the degree of correlation between features,and through the joint probability of correlation factor to calculate,therefore,this paper select mutual information algorithm to improve for feature selection.In this paper,we first improve the decision tree algorithm so that it can classify the inconsistent data directly,and get better results.This paper mainly studies the function dependence of the inconsistent data constraint condition,through prefeature and post-feature respectively according to the features of differences in the data,the algorithm design of different,so that the improved algorithm for the calculation of different pre-feature and post-feature.In this paper,the objective function of decision tree algorithm is improved,to change the method of segmentation of features in the constraint condition for partition inconsistent data.The paper mainly measures the effect of features in constraint conditio n on the classification results in various measures,so as to adjust the influence factors of the feature,and make the segmentation of the decision tree is more accurate.As the amount of data in inconsistent data increases,the dimensionality of the data feature increases as well.The feature of high dimensionality makes the construction of classification model time-consuming.For the target features,the less related features have less effect on the classification model.Based on the mutual information algorithm is improved,which can makes the feature importance evaluation on inconsistent data sets,and it can select the most influential features to the target feature to model the classification model.In this paper,the features of the function dependency in the constraint condition are divided into pre and post features,so different algorithms are improved according to the performance of the pre and post features in inconsistent data.Through the improved decision tree algorithm and mutual information algorithm,according to the comparison experiment results,we can conclude that the improved algorithm is significantly improved compared to the contrast algorithm.
Keywords/Search Tags:inconsistent data, classification, feature selection, decision tree, mutual information
PDF Full Text Request
Related items