Research On Several Key Issues In Big Data Correlation Mining

Posted on:2019-04-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X C Tang

Full Text:PDF

GTID:1318330569487456

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the increase of computing power and storage capacity of information systems,big data is consistently generated.In many scientific and business applications,various kinds of big data have been collected.The collected big data have tremendous value and have attracted great attention from all over the world.Our nation has a vast territory and a large population.Huge amounts of data are continuously generated.These data have become an important strategic resource of our nation.Big data usually has high dimensions and massive data points,which brings great opportunities and challenges for big data analysis.On the one hand,massive data points provide the entire dataset for big data analysis methods,which significantly improve the accuracy of probability estimation.On the other hand,traditional machine learning algorithms are incapable of analyzing big data with high dimensionality.Causation analysis become very difficult for big data.In contrast,correlation analysis received more and more attention,since its interpretability and efficiency are better.The research topic of this paper is big data correlation analysis.The focus of this thesis is on the correlation between features and the target variable and between interactions and the target variable.In machine learning and data mining,feature selection is widely used to analyze the correlation between features and the target variable.Feature selection is capable of identifying important features related to the target variable.Therefore,this thesis employs feature selection method to analyze feature-target and interaction-target correlations.Four key issues in big data correlation analysis are studied,including: feature interaction mining,factor ranking and significance analysis,the efficiency of big data correlation mining methods,and the application of big data correlation mining methods.The content of this thesis can be divided into the following four parts:(1)To address the issue of interaction mining in big data correlation mining,several information theoretic feature selection methods were proposed.It mainly carried out three aspects of research: First,a theoretical framework for mutual information based feature selection was proposed.It served as a theoretical foundation for feature selection.In this framework,the feature selection problem was divided into a sum of interaction terms.Theoretical analysis showed that many of the existing feature selection methods were special cases of the framework.Second,an interaction-based feature selection algorithm was proposed and implemented,referred to as Max-Interaction.Max-Interaction considered higher-order interactions through interaction information.Third,a feature selection algorithm based on joint mutual information was proposed,referred to as Fourway Joint Mutual Information(FJMI).FJMI employed four-way joint mutual information to capture two-through four-way interactions.Extensive experiments showed that MaxInteraction and FJMI could effectively identify significant interactions.They could also improve the performance of feature selection.(2)To address the issue of factor ranking and significance analysis in big data correlation mining,several DOE(Design of Experiments)based feature selection methods were proposed.It mainly carried out four aspects of research.First,a feature selection algorithm based on factorial design was proposed and implemented,referred to as Factorial Design based Feature Selection(FDFS).FDFS was capable of selecting important features and interactions simultaneously.It used p-value to show the statistical significance of each feature and interaction.FDFS had successfully discovered an important interaction in the PM2.5 data set,i.e.,the interaction between wind speed and wind direction.Second,the fractional factorial design was employed to reduce the factor level combinations of the factorial design.It allowed more features to be analyzed at a time.In order to improve the efficiency of FDFS,a fast divide-and-conquer algorithm was proposed to find the largest factorial design for the input data set.Third,an automatic parameter tuning method based on Taguchi method was proposed.The Taguchi method was capable of identifying important parameters.It also outputted the statistically optimal parameter values.Forth,a method for applying experimental design to classification problem was proposed.First,the target variable was converted to several binary variables.Then,each binary variable was converted to a continuous variable through the logit function.Finally,the original classification result was obtained by merging the binary classification results,i.e.,to conduct a linear regression on all the significant features and interactions of feature selection on binary response.(3)To address the issue of low efficiency of traditional big data correlation mining methods,several quantum acceleration methods for feature selection were proposed.It mainly carried out two aspects of research.First,a quantum acceleration for filter feature selection algorithms based on mutual information was proposed.The quantum counting algorithm was employed to accelerate the histogram-based probability estimation algorithm.The quantum minimum algorithm was used to accelerate the process of finding variable range and the maximum value of the object function.Thus,the filter feature selection algorithms achieved a quadratic acceleration.Second,a quantum acceleration for embedded feature selection algorithms was proposed.The HHL algorithm(Harrow,Hassidim and Lloyd algorithm)was employed to accelerate matrix inversion operation.The quantum inner product algorithm was employed to accelerate matrix product operation.Thus,the embedded feature selection algorithms were accelerated.(4)To address the issue of practical application of big data correlation mining methods,the proposed information theoretic feature selection methods were applied to the correlation mining of text data.The feature selection algorithms were applied to analyze the feature correlations of text big data.Due to the rapid development of the World Wide Web and social networks,a large amount of text data is collected and processed.The text big data set usually has high dimension and a large amount of data.Since the features of the text big data are many words or phrases.The proposed feature selection algorithms Max-Interaction and FJMI achieved better performance than the existing feature selection algorithms for text categorization.The proposed feature selection could also select significant features and interactions,where the interaction of words could be interpreted as phases.

Keywords/Search Tags:

big data, feature selection, mutual information, design of experiments(DOE), quantum computing

PDF Full Text Request

Related items

1	Research On Mutual Information Based Feature Selection Method For High Dimensional Small Sample Data
2	Research On Feature Selection Algorithm Based On Mutual Information
3	Research On Dynamic Feature Selection Algorithm Based On Mutual Information
4	Improvement On Mutual Information In Feature Selection Based On Composite Ratio
5	Research On Mutual Information Based Feature Selection Algorithm
6	Study Of Feature Selection Method Based On Mutual Information
7	The Feature Selection Based On Mutual Information And Decision Tree
8	Feature Selection Algorithms For High-throughput Data
9	Research On Feature Selection Algorithm Based On Lasso And Mutual Information
10	Two Feature Selection Algorithms Based On Mutual Information And Bayesian Optimization