Soybean,as one of the important global crops,is crucial to consumers and the trade market based on its quality and origin.Therefore,it is necessary to establish an accurate and reliable identification method for soybean origin.How to search for biomarkers and how to screen and analyze multiple data sources is the current urgent problem to solve for soybean origin identification.In this study,we collected 108 samples from different Soybean main producing areas in Heilongjiang and Liaoning provinces as research objects and obtained metabolomic and transcriptomic data of soybean samples using non-targeted liquid chromatography-mass spectrometry(LC-MS)and Illumina sequencing,respectively.In order to tackle the dimensionality catastrophe problem in the field of sample metabolomics data analysis,we used a strategy combining various feature selection methods with multi-omics integration analysis methods to screen out the optimal feature subset as stable and accurate biomarkers for soybean origin identification.Based on the linear logistic regression feature selection method and four feature selection algorithms: L1-regularized logistic regression(L1-LR),Recursive Feature Elimination(RFE),Incremental Feature Selection(IFS),and Sequential Backward Selection(SBS),we established a model to select the optimal feature subset.Meanwhile,we used the combination of the Correlation Bias Reduction strategy(CBR)and LR-RFE method to optimize the selected model,which enhanced the reliability of these features as biomarkers by reducing the bias and improving the correlation between them.We aimed to provide reliable feature selection methods and biomarker selection basis for soybean origin identification by comparing and analyzing the classification accuracy and feature quantity of the model.This will help to improve the accuracy and reliability of soybean quality identification and meet the needs of consumers and the trade market.Main conclusions of the study are as follows:(1)The feature selection method based on linear logistic regression and the multi-omics integration analysis strategy can be applied to the origin identification of soybean in Heilongjiang and Liaoning Province.Compared to the use of monoomic data,the accuracy of the integrated analysis strategy for multiomics is more prominent,and the classification performance of the interim integration analysis method is slightly higher than the preliminary integration analysis method.(2)LR-RFE + CBR algorithm can effectively reduce the model analysis process,vulnerable to the possible correlation between features,the LR-RFE + CBR algorithm optimization after combining a single omics data or using multi-omics integration strategy model classification performance have significantly improved,based on interim integration analysis method model the highest accuracy,reached 99.83%.(3)The linear combination feature selection method based on logistic regression shows high model performance when compared with the filtering feature selection method.The combined feature selection method achieved at least 0.97% model performance either on single group data or multiple group integrated data.When compared with the package feature selection method,the combined feature selection method can significantly reduce the number of features selected by the L1-LR feature selection method.This helps to improve model performance and proves that backward feature selection methods can effectively remove redundant features in soybean omics data.This approach also helps in improving the model performance.Therefore,linear combinatorial feature selection methods based on logistic regression have great potential in the analysis of soybean omics data.(4)Pathway analysis was used to verify the optimal subset of features selected by the model optimization combined with the interim integration analysis algorithm.This optimal subset of features contains 33 transcriptomic features and 12 metabolic features.Through the pathway analysis approach,we can conclude that these features are clearly associated with each other.Therefore,these features can be used as biomarkers to distinguish between Heilongjiang and Liaoning provinces.It has important scientific significance and practical application value for the study of soybean biommarkers in different origin. |