Font Size: a A A

Application Of Big Data And Artificial Intelligence Approaches For Quantitative Nanostructure-Activity Relationship Modeling

Posted on:2021-08-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L YanFull Text:PDF
GTID:1521306020960669Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
Nanomaterials are widely used in many fields such as biomedicine,environment,chemical engineering,energy,transportion and etc.,due to the unique physical and chemical properties induced by their size effects.According to the stastics,as of 2016,the value of the global nanomaterials-related market has reached 39.2 billion U.S.dollars,and it is expected to research 90.5 billion U.S.dollars by 2021.In addition,the use of nanomaterials in daily consumer is also increasing.Thousands of civilian nano products have entered our life,such as cosmetics and personal care products.The nanoaparticles contained in these products may fall off into the atmosphere or water,accumulate in the environment,and increase the possibility of entering the human body.Therefore,it has received increasing attention and research about the behavior and biological activities of typical nanomaterials.Nanotoxicology research has received more and more attention since 2005:as of March 2020,the number of publications is as high as 99,571 while searching in the Web of Science database with the keywords"nano*" and "toxic*".The research on the factors affecting the toxicity of nanomaterials is also endless.However,due to the factors such as data collection method and big data analysis technology,researchers cannot effectively extract useful information from big data about nano-biological activities.In recent years,artificial intelligence(AI),typified by methods such as machine learning and deep learning,has achieved fruitful results in the area of material discovery,drug design and medical diagnosis.The classical machine learning methods(e.g.,knearest neighbor,random forest and support vector machine)and popular deep learing methods(e.g.,convolutional neural network,recurrent neural network and generative adversarial network)give the AI system a powerful learning ability,which enables it quickly and effectively mine the useful information from the data to achieve one or multiple tasks.For example,in the medical field,AI has shown great application potential in many aspects such as detection and classification of lesions,image segmentation,motion tracking,radiation dose predction,and radiotherapy consequence predction.Since the powerful ability of AI comes from the training of massive data,it also has a very broad applicaition prospect in the field of nano research,especially in the deep interpretation and intelligent prediction of nano-biological activities.However,the current artificial intelligence technology has two major bottlenecks in predicting nano-biological activities by constructing a quantitative nanostructure activity relationship(QNAR)model:(1)Lack of suitable descriptors to characterize the nanostructures.Current nanodescriptors can be divided into experimental descriptors,structural descriptors and empirical descriptors.These descriptors have their own deficiencies and limitations when applied to QNAR modeling.Experimental descriptors,such as nanomaterial size,Zeta potential and protein corona figerprints,have poor repeatability,and different laboratories often get different experimental results.More importantly,for nanomaterials that have not yet been synthesized,the experimental descriptors cannot be acquired,so they cannot be used for virtual screening and new nanomaterial design.The structural descriptors are usually calculated from the surface ligands or only the core components which are converted into SMILES(For example,the zinc oxide nanomaterial is converted to O=[Zn]).However,the surface ligands or core components cannot represent the nanostructure diversity.Empirical descriptors,usually involving the application of quantum chemistry or molecular simulation,require a lot of computational resources.As a result,they are not suitable for nanomaterials containing tens of thousands of atoms.(2)Lack of suitable nanomaterial databases that can be directly used for machine learning.The current nanomaterial datasets used for machine learning usually contain only a few biological activity data points(less than 40)for one material type.QNAR models based on such data usually have poor predictive power when predicting the properties of new materials.On the other hand,the few available nanomaterial databases only directly extract text information(e.g.,the physicochemical properties and biological activities)from the publications,but ignore the nanostructure digitalization.The electronic files used to store the nanostructure information are the"bridge" connecting experimental data and computational modeling.With these electronic files,we can perfom visual analysis,descriptor calculation,molecular simulation and quantum chemistry calculation.In order to solve the two major bottlenecks during the application of AI for QNAR modeling,this research mainly includes three parts:Ⅰ In silico profiling nanoparticles:predictive nanomodeling using universal nanodescriptors and various machine learning approachesTo address the shortcomings of existing nanodescriptors,we developed a novel nanodescriptor based on Delaunay tessellation and atomic electronegativity.Delaunay tessellation is an important area reconstruction technique.It has the characteristics of"closest to regular" triangulation and "uniqueness".It can be used to divide the complex three-dimensional nanostructures into simple tetrahedral fragments.Electronegativity covers many important properties of atoms,such as polarity,energy and the ability to form hydrogen bonds.In this study,we firstly constructed a virtual nanostaucture,that is,to store the nanostructure information into a PDB file.The atoms in the virtual nanostructure are then divided into six categories:C(Carbon),N(Nitrogen),O(Oxygen),S(Sulfur),M(Metal atoms)and X(Phosphorus and halogen atoms).Then,based on the Delaunay tessellation,every four nearest neighboring atoms(e.g.,CCCC and CCOO,etc.)that can form a tetrahedron were identified from the virtual nanostructure.In previous studies,a cutoff of 8 (?) was employed in this calculation based on the long-range electrostatics and van der Waals interactions in the molecular dynamic simulations.A tetrahedron was excluded when the distance between two atoms was above 8 (?).Based on our definition of atom types,there were a total of 126 nanodescriptors without taking into account their atom order.The specific value of the descriptor in each nanomaterial is obtained by the product of the number of tetrahedrons in the corresponding nanomaterial and the sum of the electronegativity of the four atoms contained in the tetrahedron.The obtained nanodescriptors can not only characterize the overall nanostructures(e.g.,material type,size,ligand position,ligand density and etc.),but also have fast and high-throughput calculation characteristics.For 5 nm gold nanoparticles with more than 26,000 atoms,calculations can be completed in 10 seconds on a personal desktop,and descriptor calculations of more than 1,000 nanomaterials can be processed in batches.To test the effectiveness of the obtained novel nanodescriptors,we collected seven datasets containing 191 gold nanoparticles,two physicochemical properties(logP and Zeta potential),three nano-biological activities(cellular uptake,oxidative stress level and AChE enzyme activity).The random forest(RF)and k-Nearest Neighbor(kNN)machine learning methods are used to build QNAR models to predict the physicochemical properties and nano-biological activities mentioned above.The fivefold cross validation and external validation of all models have high prediction accuracy(R2>0.68).In addition,the obtained nanodescriptors have clear physical meanings,which can assist experimental results to perform mechanism analysis.For example,we find that CCCC nanodescriptors with four carbon atoms are most frequently used in the models for logP and cellular uptake predictions.That is,the carbon skeleton structures represented by CCCC nanodescriptors play the most important role in the two properties.Ⅱ Construction of a web-base nanomaterial database by big data curation and modeling friendly nanostructure annotationsTo make up for the shortcomings of the existing nanomaterials database,we built the world’s first nanomaterial database based on the digitalization of nanostructures.The database currently contains over 700 nanostructure electronic files(PDB files)covering 12 different material types(e.g.,gold nanoparticles,silver nanoparticles,metal oxides nanoparticles,carbon nanotubes and etc.),as well as more than 1,300 physicochemical data points and more than 2,300 nano-biological activity data points.The experimental data includes in-house experimental data and external data collected from the previous published papers.The in-house experimental data is mainly synthesized by the laboratory through combinatorial chemistry methods during the past ten years,and the external data is obtained by screening nearly 1,000 publications.All experimental data and PDB files are stored on the web portal(http://www.pubvinas.com/),and researchers can register and log in the web portal for downloading and using the nanomaterial data.The database also allows researchers to upload new data,keeping the database dynamically updated.The nanostructure digitalization,that is,the construction of virtual nanomaterials,is similar to the research content I,but in this study,we have added more material types.All PDB files are mainly composed of three parts,one part stores the basic information of the nanostructures(e.g.,material type,nanomaterial size and surface ligand number),one part stores atom type and atom coordinate information,and the last part stores atom connection information.With these PDB files,we can directly visualize the nanostructures,calculate nanodescriptors,perform molecular dynamics simulations and even quantum chemistry calculations.In this study,different nanomaterials are shown under a uniform scale,and we can intuitively visualize the nanostructure difference in terms of material types,nanomaterial sizes,nanomaterial shapes,and surface ligands.In addition,different surface chemistries of the nanostructures were rendered with different colours.For example,the nanoparticle PdNP12(logP=2.52)with hydrophobic surface ligands are shown as cyan while the nanoparticle PtNP8(logP=-1.47)with hydrophilic surface ligands are rendered purple.Other structural details can also be observed,for example,the long surface ligand chains on GNP164 are shown as tentacles.These detailed 3D plots of nanomaterials in the database which are generated from the PDB files,providing direct impressions of the relevant surface chemistry and physicochemical propertiesIn addition,the database contains more than ten different nanomaterial endpoints,all of which have a wide range of distributions.The rich data of physicochemical properties and biological activities have laid the foundation for machine learning.In this study,we further improved the nanodescriptors in research content I to make it suitable for more nanomaterials other than gold nanoparticles.Related improvements include converting the number of tetrahedrons to percentages,thereby eliminating the huge differences in descriptors brought by the size of nanomaterials;adding more elemental properties besides electronegativity to enable descriptors to distinguish various nanomaterials.Based on these improvements,we can distinguish almost all nanostructures in the database.By calculating the Euclidean distance,we can quantitatively analyze the similarity between all nanostructures.Using classical machine learning methods and deep learning approaches,we have accurately predicted the different physicochemical properties and biological activities of various nanomaterials.Ⅲ A preliminary study on the mechanism analysis of PM2.5 cytotoxicity by virtual carbon nanoparticle library and machine learningExposure to PM2.5 air pollution causes critical adverse heath outcomes includcing ischemic heart disease,strokes,chronic obstructive pulmonary disease,respiratory infections,and even lung canser.According to a study by the Global Burden of Disease(GBD)in 2016,the exposure to PM2.5 was the 6th leading contributor to early deaths globally,which resulted in approximately 4.1 million global deaths in 2016.There is an urgent need to identify key toxic components in PM2.5 and understand the associated toxicity mechanisms.However,PM2.5 is a complex mixture consisting of different components,including hundreds of organic,inorganic and bilological pollutants.Furthermore,the components of PM2.5 are both time-and region-depedent,making the experimental mechanism study extremely difficult.In a recent study,we reported a reductionism approach by synthesizing a model PM2.5 library containing 20 carbon nanoparticles,which absorbed all possible combinations of representative toxic pollutants including Cr2O72-,Pb2+,As2O3 and BaP.The model PM2.5 library was then tested in different assays for their inflammatory effects.However,considering the complex compositions of PM2.5,it is impossible to synthesize all the combinations one by one and test their biological toxicity.Therefore,in this part,we use the novel nanodescriptors and nanostructure digitalization method constructed in Chapters 2 and 3,as well as machine learning,to assist the experiment in predicting the cytotoxicity caused by PM2.5 and discussing the potential toxicity mechanism.In this research content,we constructed five different datasets,four of which are four different inflammatory responses in the 16HBE cell line,collected from our previously published paper;the other dataset is PM2.5 cytotoxicity,by testing the EC50 under the 16HBE cell line.The nanomaterials used in this study are 20 carbon nanoparticles(i.e.,model PM2.5 nanoparticles)that adsorb all possible combinations of toxic pollutants with a particle size of 40 nm.The amount of adsorbed pollutants is in the same order of magnitude with actual PM2.5(PM2.5-JN)collected from Jinan,Shandong Province.The physicochemical properties such as Zeta potential and hydradynamic diameters are also similar to that of PM2.5-JN,indicating that the model PM2.5 nanoparticles can simulate the actual PM2.5 samples.Based on the experimental data of carbon nanoparticles,and using the nanostructure digitalization method described in Chapters 2 and 3,we constructed 20 corresponding virtual carbon nanoparticles.Using these virtual carbon nanostructures,we calculated 126 nanodescriptors that can quantitatively characterize the carbon nanostructures.We found that these nanodescriptors can distinguish all the model PM2.5 nanoparticles.The machine learning models constructed by random forest and k-nearest neighbor method can make accurate predictions for the four inflammatory responses and EC50 values.The fitting coefficients of the predicted and experimental values are both above 0.65.More importantly,by analyzing the model results,we explored the possible toxicity mechanism of PM2.5 and found that Pb2+ is the key factor that caused an inflammatory response,and Cr(Ⅵ)is an important factor that caused cell death.This dataset has also been added to the database we created in chapter 3.In summary,through three parts of research,we constructed a novel nanodescriptor suitable for a variety of nanomaterials,and a nanomaterial database containing electronic files of nanostructures that can be directly used for machine learning.Furthermore,the novel nanodescriptors and nanostructure digitalization method are applied to PM2.5 toxicity prediction and potential mechanism analysis.The above research can make artificial intelligence better applied in the field of nano research to guide the safety assessment and design of nanomaterials.
Keywords/Search Tags:Artificial intelligence, Big data, Machine learning, Deep learning, Quantitative nanostructure activity relationship
PDF Full Text Request
Related items