Objective:In order to clarify the relationship between the formula and its efficacy,and to further explore the internal rules of formula combination,we propose an improved deep learning model to predict the formula efficacy.At the same time,we construct a multi-label classification model combined with ensemble learning to study the relationship between formula and efficacy.It provides a basis for TCM clinical syndrome differentiation,treatment and prescription.It also provides a new way to analyze the basic theory of TCM.Methods:First,natural language processing(NLP)is used to learn and realize the quantitative expression of different TCM herbs.Herb properties and herb efficacy are selected to encode herbs and to construct formula-vector and herb-vector.We construct the herb-vector and formula-vector containing the characteristic information of TCM herb.It provides uniformly coded data containing TCM features for deep learning models.Then,an improved deep learning model Text BLCNN consists of a bidirectional Long Short-Term Memory(Bi-LSTM)neural network and a Convolutional Neural Network(CNN)is proposed to classify the TCM formulae.In addition,aiming at the imbalance problem of formula data,the over-sampling method SMOTE is used to solve it.Finally,Binary Relevance(BR),Label Powerset(LP)and Classifier Chain(CC)are respectively used to combine multiple binary classifiers,such as Random Forest and Adaboost into the final multi-label classification model in order to analyze the formula-efficacy relationship.Results:The experimental results show that the formula-vector composed of herb efficacy has the best effect on the classification model.Text BLCNN model has the accuracy of0.858 and F1-score of 0.762,both higher than Logistic Regression,SVM,LSTM and Text CNN models.In addition,the over-sampling method SMOTE is used in our model to tackle data imbalance,and F1-score is greatly improved by an average of 47.1%.The experimental results show the superiority of the proposed model.The overall effect of the multi-label classification model LP is higher than that of the BR and CC models.The effect of ensemble learning as a binary classifier of a multi-label model is better than that of classic binary classifiers such as Decision Tree,Linear Discriminant analysis,K-nearest neighbor,Naive Bayes,and multi-layer perceptron.Conclusion:The formula feature representation combined with Text BLCNN model and multi-label classification model can improve the accuracy in formula efficacy classification.It provides a new research idea for the study of TCM formulae compatibility.In the future work,the drug dosage in the prescription will be considered,and the classification model will be further optimized.In addition,bioinformatics technology can be used for formula analysis in order to find out the modern scientific connotation of formula compatibility. |