| The geogenic contaminated groundwaters(GCGs)such as high arsenic,high fluoride,high iodine,and high iron groundwater pose a global public health threat.Rapid identification of the spatial distribution of GCGs and resolving the genesis of GCGs is the prerequisite for developing effective policies to eliminate the health threats of GCGs.The inherent invisibility of groundwater restricts the acquisition of high-resolution spatial and temporal data,which,together with data protection policies,leads to the scarcity of groundwater quality data.Existing spatial distribution prediction methods such as geostatistical interpolation and reactive transport modeling cannot accurately portray the spatial distribution of GCGs based on sparse groundwater quality data.The accumulation of harmful components in groundwater is the consequence of the combined results of geological,meteorological,hydrochemical and microbiological environmental factors acting individually(main effect)and jointly(interaction effect).Statistical methods such as bivariate correlation used in traditional hydrogeological/hydrochemical studies cannot deal with the global,non-linear and non-additive relationships between environmental factors and hazardous components involved in main effects and interaction effects.The burgeon of machine learning(ML)offers potential opportunities for predicting the spatial distribution of GCGs and gaining deeper insight into the genesis of GCGs.Therefore,this thesis first evaluates the predictive ability of GCGs machine learning models for no/few sample regions in the study area.Based on the evaluation results,GCGs prediction methods applicable to the no/few sample conditions were established:(1)a model-level method,twin network transfer learning(SNTL),and(2)a sample synthesis method,local distribution dependent synthesis algorithm(LDS).SNTL is a fusion of siamese neural network and transfer learning.The two-branch structure of the siamese neural network extends the number of groundwater samples and eliminates the class imbalance.Transfer learning enables GCGs predictive models to take full advantage of pre-trained models,greatly reducing the number of groundwater samples required for modeling and the difficulty of training models.Based on the above modeling strategy,a neural network-based interaction detection method that identifies,separates,and visualizes main effects and interaction effects was proposed.The method consists of four steps:(1)building a predictive model using all environmental factors to identify significant environmental factors;(2)building a neural network model for interaction mining using important factors;(3)identifying potential interaction terms based on a neural interaction detection algorithm;and(4)visualizing interaction effects using a two-dimensional accumulation local effects plot.The main findings are as follows:(1)When multiple hydrogeological units are included in the study area,the GCGs predictive model cannot accurately predict the GCGs distribution in the no/few sample units.Treating test data sampled from known data as unknown data would greatly exaggerate the predictive power of the model in unknown units.(2)Under the condition that only a small amount of groundwater quality data is available,the prediction performance of SNTL is significantly better than that of the commonly used random forest model.The similarity in the genesis of different types of GCGs enables SNTL to be used for the prediction of different types of GCGs.(3)Sample synthesis can effectively improve the performance of GCGs prediction model.Random oversampling,Adaptive Synthetic Sampling(ADASYN)and the LDS algorithm first used in this thesis can effectively improve the model G-Mean by more than 10%,and LDS is the optimal algorithm.(4)The interaction effect of environmental factors significantly affects the formation of GCGs.For the local system,i.e.,the Datong basin,the interaction effect contributes7.5% to the performance of the high arsenic groundwater model;for the global scale,the interaction effect contributes 9.7% to the performance of the high arsenic groundwater model.Regardless of the spatial scales,the number of true interaction terms only accounts for 5% of all possible interactions.Second-order interactions are the main interaction level,while all higher-order interactions are associated with lower-order terms.The above interactions are consistent with the principles of interaction sparsity,interaction hierarchy and interaction heritability,showing the reliability of the neural interaction detection framework.In summary,the innovations of this study are(1)the proposed SNTL and LDS methods for accurate prediction of GCGs at different scales and sample size conditions,and(2)the proposed neural network method for identifying the main effectors and interaction effects controlling the formation of GCGs from the prediction model based on the GCGs prediction model and hydrogeochemical mechanisms.The machine learning framework proposed in this thesis provides a data-driven perspective for analyzing critical groundwater quality issues and is expected to be a powerful tool to support public health management decisions and groundwater research. |