| In recent years,advances in machine learning techniques have facilitated the development of virtual screening of organic solar cells(OSC)materials.Currently,the property descriptors of donor and acceptor(D/A)molecules are used to predict power conversion efficiency(PCE).Although relatively good results can be obtained,the acquisition and screening of property descriptors still inevitably increase the workload and cost.To improve machine learning performance,a D/A pairwise molecular embedding representation is applied to predict the of PCE of OSCs.In this study,two kinds of structural descriptors,molecular fingerprints and SMILES,are considered.The models using molecular fingerprints as the original inputs considers the fingerprints of D/A molecular pairs in tandem and parallel modes,obtains the embedding representation of the substructure level of the pair of molecules,and constructs regression prediction models based on convolutional neural network(CNN).The models using molecular SMILES as the original inputs extends the molecular embedding method based on paired molecular fingerprints to the pre-training task.The pre-training work is carried out on three related datasets to obtain word lists containing frequent substructures,and then the pairwise molecular embedding representation is generated and the regression prediction models are constructed.In addition,this study also involves dataset partition,property descriptor screening and other related works.Experimental results show that the correlation coefficient(r)of the optimal result of the regression prediction models based on the substructure level embedding representation of molecular pairs generated in this study reaches 0.89,which is significantly better than the baseline models in all aspects,and has a certain degree of competitiveness compared with the excellent models in recent related studies.On this basis,the comparison results show the embedding can better represent molecules than property descriptors,since it is more comprehensive than factitious description.Our experiments also demonstrated that D/A pairwise molecular embeddings are more effective inputs than D or A single molecular embedding due to interaction information probably extracted from molecular pairs.The reasonability of the molecular embedding is further manifested by visualization.The t-SNE visualization of molecular embedding shows that the reduced embedding representation can cluster target PCE values very well.And in the embedding space the substructure similarity agrees well the fingerprint embedding distance.Promisingly,this study provides a new idea for virtual screening work,proves that it is feasible to use only the structural descriptors of material molecules to construct the models,and the proposed model and the pairwise embedding generation algorithm shows the advantages.It can shorten the cycle of virtual screening in practical applications,and facilitate screening high-performance OSC materials. |