| Drug development pipeline is a lengthy,complex,and costly process with numerous confounding factors.Prediction of chemical molecular properties has always been a fundamental and challenging task in the early stages of drug discovery.Efficient and accurate prediction of molecular properties is very appealing for rational compound design in the chemical and pharmaceutical industries.Machine learning and deep learning have also shown great application potential in the field of chemistry with increasing accuracy over time,and provide significant opportunities for drug discovery research and development.In this thesis,we proposed a data-driven and fully data-driven deep learning architecture BCSA model to predict the physical properties of molecules,including the water solubility(logS),the oil-water partition coefficient(logP/logD).First,We used the advantage of molecular SMILES sequence representation to build a deep learning model based on bidirectional long short-term memory network to predict molecular solubility.Meanwhile,the Channel Attention and Spatial Attention modules,which can focus on the most relevant part of the input through exploring the global and local properties of molecular feature vectors from the perspective of intramolecular space,are also introduced to optimize the model.The training results show that introducing the two attention modules significantly improves the prediction accuracy~2 by 5%in both the verification set and the test set.Among them,Bayesian optimization yields the model parameters for the best performance during the training process.In addition,we enhance the generalization ability of the model through SMILES enhancement technique.The result clearly shows that the accuracy,generalizability and overfitting problem are improved with an increasing number of enumerated SMILES strings,and the higher the augmentation factor,the better the performance.Also,our fitting step achieves 88%accuracy on the test set after a 40-fold amplification.Second,we explored the effects of the graph neural network models,which have attracted much attention in the scientific community,in the prediction of water solubility.Three prediction architectures Graph Convoluted Neural network(GCN),Message Passing Neural Network(MPNN),and AttentiveFinger Prints(AttentiveFP)are constructed based on molecular graph representation.Among the three models that rely on the structural information of the original molecular graph,we find that the GCN model has the best performance,close to the BCSA model.This shows that GCN model have the ability to learn almost any information embedded in molecular features using only relatively few atomic properties.And the experimental results show that predictable and unpredictable molecules for water solubility are largely the same in different models.Then,we extended the predictions of other relevant molecular properties,namely the oil–water partition coefficients logP and logD(pH=7.4),in order to verify the generality of the BCSA model.Encouragingly,the BCSA model appears reliable and reaches prediction accuracy with~2 of 99%and 93%respectively without relying on auxiliary knowledge.After comparative experiments with three graph models(GCN,MPNN,AttentiveFP),the BCSA model still shows the best performance,followed by AttentiveFP,which fully demonstrates the strong generalization ability and robustness of our model.Finally,we built a prediction visualization platform for chemical experts and related researchers to use the BCSA model quickly and easily,which can be accessed for free at http://cadd.siat.ac.cn/molpre/. |