Font Size: a A A

Study On Stable Correlation Among Features In A Dataset

Posted on:2021-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:2518306050467314Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The correlation refers to a certain relationship between variables in a data set,that is a certain rule between the values of two or more variables,which is the core research goal of data science.To detect these rules and analyze the rules among variables in the data set is a way to study and understand data,and then try to summarize knowledge from the data.In different fields,the understanding and using of the prior knowledge in this field determines whether the researchers can give a correct evaluation of the calculated correlation.At present,a lot of progress has been made in the detection of the correlation between variables from the data set.From the Pearson correlation and mutual information,which are the two most original and common methods,to the maximum information coefficient,the detective ability of the correlation between data has been greatly improved.However,there are fewer researches about whether the detected relationship is reliable enough,the result is significant or not,and whether it is stable.This thesis analyzes how to measure the stability of the correlation,designs a measurement framework for the stability of the correlation,and implements it.The innovative results are as follows:1.Based on the theory of bootstrap and confidence level,a measurement framework for the stability of the correlation between features in the data set is designed,which is used to detect the stability of the correlation between features in the data set,to evaluate the effect of detected correlation.Based on the actual situation,the corresponding processing method is given for the long-tailed distribution.2.Based on the complex network theory and the maximum spanning tree algorithm,an unsupervised feature selection algorithm is designed by using the stability measurement framework.On the one hand,this algorithm can be used for feature selection of data sets,on the other hand,it can be used to verify the effectiveness of the proposed stability measurement framework.3.Compared with the existing unsupervised feature selection algorithm,the method proposed in this thesis has better performance.Using the UCI data set and meningitis data set from a third-grade hospital,the algorithm proposed in this thesis and the existing algorithm are used for feature selection,and then the decision tree and random forest classifier are used to verify the impact of the selected features,which is measured by the final classification results.The experiments show that the proposed method is better than many traditional unsupervised feature selection algorithms.The algorithm proposed by this thesis improves the final classification accuracy by 1% ~ 11% on multiple datasets;then by comparing the feature sequences given by a variety of filtering algorithms,in general,the algorithm can filter out better subsets of features faster than others.To sum up,this thesis studies how to measure the stability of the correlation between features in data sets gives a stability measurement framework and an unsupervised feature selection algorithm based on this stability.Experimental results show that the features selection algorithm proposed in this thesis have a better effect on improving the classification accuracy,that is,the stability measurement framework proposed in this thesis has certain rationality and effectiveness.
Keywords/Search Tags:Correlation, Stability, Feature Selection, Complex network, Data Mining
PDF Full Text Request
Related items