Integrative Analysis On High-dimensional Network-based Heterogeneous Datasets

Posted on:2022-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Zhang

Full Text:PDF

GTID:2568306326974229

Subject:Applied Statistics

Abstract/Summary:

With the data wave sweeping the world,emerging data problems have brought great challenges to traditional data mining techniques.The first is the diversity of data sources.For example,the electronic medical record data for the same disease may come from different hospitals,and the real estate transaction data may also come from different communities.Although these data are collected for the same or similar tasks,due to differences in factors such as measurement environment,measurement standards,or statistical calibers,heterogeneity among datasets cannot be ignored.The second is the high dimensionality and sparsity of the data.The value density of big data is low.Any tiny information that may help explain will be collected.Therefore,the information needs to be purified and denoised when modeling.In order to improve the limitations of existing research on modeling complex multisource high-dimensional data sets,this paper absorbs the essence of previous methods and comes up with a new idea.For the problem of heterogeneity,we can use the relevant information between data sources to reduce the complexity of the integration task;at the same time,in order to solve the problem of high-dimensional and sparse data,we can combine variable selection methods with integrative analysis methods to adapt to high-dimensional situations.Based on this,this paper proposes a snMCP(sparse network Minimax Concave Penalty)-Logistic model that integrates multi-source highdimensional data,uses the K-nearest neighbor method to build a network structure between data sources,and imposes a Network MCP penalty on the model coefficients of datasets with network connections.To automatically identify homogeneous data and heterogeneous data,MCP penalty is used to filter the important variables of each dataset,which can simultaneously realize model estimation and clustering of datasets.For large and complex objective functions,this paper derives the corresponding ADMM(Alternating Direction Method of Multipliers)algorithm for the optimization problem.In order to measure the pros and cons of the snMCP-Logistic model,this paper designs three typical numerical experiments.Simulation experiments show that the proposed method has good results in feature selection,parameter estimation and classification prediction accuracy under different simulation settings.In the aspect of empirical analysis,we use heterogeneous real estate rental evaluation data with 397 sources for analysis,where latitude and longitude location information is provided to construct a network structure between data sources.The empirical results show that the method proposed can effectively utilize the heterogeneity generated by regional factors and improve the classification ability of the model.Not only the AUC index obtained is the highest among all models,but also the corresponding important variable set can be given for different regional locations.It improves the interpretation of the model,increases the pertinence of future data collection,and demonstrates the good performance of the method we proposed in practical applications.

Keywords/Search Tags:

Multi-source High-dimensional Data, Integrative Analysis, Network Structure, Logistic Model

Related items

1	Construction And Application Of LASSO Logistic Model For Fat (High) Big Data
2	Analysis And Design Of High Dimensional Data Visualization Structure Model
3	The Research And Development Of Multi-Physics Coupling Algorithm For Three-Dimensional High-Density Integrated System
4	High-dimensional Covariance Learning
5	Multi-source Sensor Data Fusion And Its Applications In The Target Detection
6	The Research On Web Structure Mining And High Dimensional Data Mining
7	Inference of Low-Dimensional Latent Structure in High-Dimensional Data
8	Correlation Analysis And Network Construction For High Dimensional Data
9	The Construction Of Breast Cancer Integrative Data Analysis Platform And The Identification Of Molecular Markers
10	Design And Implementation Of Organization Structure Reasoning System Based On Multi Source Data