| With the data wave sweeping the world,emerging data problems have brought great challenges to traditional data mining techniques.The first is the diversity of data sources.For example,the electronic medical record data for the same disease may come from different hospitals,and the real estate transaction data may also come from different communities.Although these data are collected for the same or similar tasks,due to differences in factors such as measurement environment,measurement standards,or statistical calibers,heterogeneity among datasets cannot be ignored.The second is the high dimensionality and sparsity of the data.The value density of big data is low.Any tiny information that may help explain will be collected.Therefore,the information needs to be purified and denoised when modeling.In order to improve the limitations of existing research on modeling complex multisource high-dimensional data sets,this paper absorbs the essence of previous methods and comes up with a new idea.For the problem of heterogeneity,we can use the relevant information between data sources to reduce the complexity of the integration task;at the same time,in order to solve the problem of high-dimensional and sparse data,we can combine variable selection methods with integrative analysis methods to adapt to high-dimensional situations.Based on this,this paper proposes a snMCP(sparse network Minimax Concave Penalty)-Logistic model that integrates multi-source highdimensional data,uses the K-nearest neighbor method to build a network structure between data sources,and imposes a Network MCP penalty on the model coefficients of datasets with network connections.To automatically identify homogeneous data and heterogeneous data,MCP penalty is used to filter the important variables of each dataset,which can simultaneously realize model estimation and clustering of datasets.For large and complex objective functions,this paper derives the corresponding ADMM(Alternating Direction Method of Multipliers)algorithm for the optimization problem.In order to measure the pros and cons of the snMCP-Logistic model,this paper designs three typical numerical experiments.Simulation experiments show that the proposed method has good results in feature selection,parameter estimation and classification prediction accuracy under different simulation settings.In the aspect of empirical analysis,we use heterogeneous real estate rental evaluation data with 397 sources for analysis,where latitude and longitude location information is provided to construct a network structure between data sources.The empirical results show that the method proposed can effectively utilize the heterogeneity generated by regional factors and improve the classification ability of the model.Not only the AUC index obtained is the highest among all models,but also the corresponding important variable set can be given for different regional locations.It improves the interpretation of the model,increases the pertinence of future data collection,and demonstrates the good performance of the method we proposed in practical applications. |