| Small molecules are the material foundation for most of inorganic substance and organic substance in the natural world,they are not only widespread on space but also diverse on structure.The study of small molecular compounds is important not only for industrial manufacture or our daily life,but also for finding the reasons of occurrence and development of many metabolism-related diseases or drug development.There are more than 1 billion small molecular records in Pub Chem database.The diversity of structures and the huge amount are the foundation of functional diversity of small molecular compounds,but also the obstacle of studying their structure and function.Since huge amount and structural diversity,the tasks about storage,query and classification for small molecules become difficult.Better representation method is the basic of solving and optimizing these tasks.Inspired by natural language processing(NLP),this article use “fragment vector” and“molecular vector” to represent fragments and molecules respectively.In order to solve the problem caused by the structure of multiple branches,i.e.the relations between each two fragments are hard to represent as a linear sequence,this article shows two methods to solve this problem: Tandem Fragment and Parallel Fragment.This article also compares different methods and hyper-parameters for training fragment vectors systematically.In order to classify small molecules and evaluate the representation ability of molecular vectors in the following task,this article selects 9 different molecular descriptors(MDs)and use multiple regression model based on multiple layer perceptron(MLP)to predict the value of each MD.Then this article compares different methods which can generate molecular vectors based on the same basic parameters.These methods are Mol2 vec,Tandem Fragment and Parallel Fragment.As a result,Tandem Fragment with subword embedding can give the best performance to predict the value of MDs on the classes of non-ring molecules which does not show in the training set(10 classes have the highest accuracy out of 12 classes).Finally,some subclasses still contain too much small molecules after classified by the method based on MDs,this article gives a new workflow to solve this problem.As an example,a class which doesn’t show in the training set is selected.Firstly,the molecular vectors of this class are reduced the dimensions to 2D by t-SNE,then cluster by DBSCAN based on 2D molecular vectors.We can further mark a small part of compounds on each cluster based on expert knowledge to determine the true class.This workflow provides a new strategy to get smaller clusters based on molecular vectors.To summarize,there are 4 key contributions in this thesis:(1)Improved the tree decomposition published by Wengong Jin et al.,so we can get more fragments and faster to do the calculation;(2)A new method to arrange fragments as a sequential linear sequence is presented to create molecular sentence,i.e.Tandem Fragment,then use this molecule sentence to train fragment vectors with subword embedding,and it has better performance on non-ring molecules classification task;(3)As far as I know,this is the first time that shows analogical similarity among fragment vectors,and this kind of similarity may also hold in molecular level;(4)A new structure-based workflow is purposed to get smaller clusters and classify big subclasses further. |