The establishment of structured databases of TCM symptoms,syndromes,and clinical efficacy is essential for the modernization of TCM.Randomized controlled trials(RCTs)carry the latest clinical diagnoses and treatment evidence but still lack structuration.Modern Traditional Chinese medicine RCT has many publications and covers a wide range,reflecting the combination of traditional Chinese medicine and the current scientific research paradigm.It is the first-hand information in the evidence system of traditional Chinese medicine and has essential value for clinical decision-making and effect evaluation of treatment plans.It is also a necessary part of traditional Chinese medicine’s big health data research.In-depth mining and intelligent analysis of TCM RCT literature content is the crucial technology to decode TCM and a critical path to realizing the modernization of TCM.Generally speaking,TCM RCT literature is described in natural language,and Chinese literature is the majority.Human reading,information extraction,induction,and summary are the main research methods.Data analysis from Chinese sources is more complex in natural language processing,and there are fewer mature algorithms,especially in TCM.RCT research has a standard research paradigm and experimental design requirements.The published RCTrelated scientific and technological literature has a robust and unified feature suitable for structured automatic extraction.At the same time,structured RCT literature has positive significance for further statistical analysis and evidence mining.Therefore,this study aims to form a deep extraction mode of Chinese TCM RCT scientific literature based on artificial intelligence automation and realize intelligently structured extraction of crucial information from traditional Chinese medicine RCT scientific literature.The main research contents include:(1)From the perspective of the application of information extraction methods,the structure and language characteristics of TCM RCT literature were elaborated by literature research and induction and deduction;(2)Use regular expression based rule extraction method to build a knowledge base of RCT scientific literature extraction in the field of TCM;(3)based on the above,we annotate the corpus of 15 objects with poor extraction effect by using the method of combining rule extraction and machine learning,introduce BERT-CRF,BiLSTM-CRF and BERT-LSTM-CRF models to train the corpus,and improve the extraction effect of some objects by combining machine learning methods.It provides a reference paradigm for extracting object information in the field comprehensively and objectively.(4)The extraction method of the above study was applied to explore the application scenarios of automatically structured RCT literature and to try the literature screening process of systematic evaluation from the automation perspective.The results of this study are as follows:(1)to the existing RCT test design specification and one thousand published papers in Chinese RCT literature of traditional Chinese medicine for reference,from the title and abstract,six aspects such as the general data,results.The conclusion summarizes the RCT literature on science and technology structure and linguistic features and analysis the critical part of the feature extraction and the general law.(2)In-depth extraction of literature content based on knowing the language and structural characteristics.48523 RCT references containing 11 common disease species were used as the extraction scope,and 40 extraction elements were formed,including number level,phrase level,sentence level,and paragraph level.We established one thousand fifty general rules and 3000 lines of available codes.(3)The best training result of BERT-LSTM-CRF(F1 score=0.61)was selected as the optimal model and was extracted again in combination with the rule base of the previous study.In addition,we sampled 4849 kinds of literature for examination,and the results showed that the average accuracy was 0.92,which achieved better results than the simple rule extraction.In the extraction process,we also found that more than 80%of the information related to study design in the TCM RCT literature was missing,which indicated that the research quality of the included TCM RCT literature was poor.It was necessary to strengthen training and professional guidance for clinical researchers in trial design,data collection,analysis,and report.(4)Exploration of application scenarios of automatic extraction methods:Automatic screening procedures are used for the included literature steps of published system evaluation,and manual screening procedures are compared.The results showed that applying the above methods to the literature screening step of systematic reviews and extraction results obtained more accurate inclusion results than manual screening.It indicates that automatic literature screening can include more RCT studies that meet standards in a shorter time and reduce manual error and improve screening efficiency.The above research shows that,based on summarizing the literature structure and language features,forming a rule base for RCT research of TCM can effectively carry out structural classification and meticulous extraction of RCT scientific literature in TCM.Secondly,combining rule base with machine learning can significantly improve the accuracy of information extraction.It is feasible to use an automatic screening method for systematic evaluation in traditional Chinese medicine.In the further study,we will further improve the precision of the TCM literature information extraction.And extend to the depth of the full range of science and technology of clinical literature of TCM extraction,covering warehouse structured clinical science and technology of Chinese medicine literature evidence,and the extracted deep mining area of information and applications.In the field of evolution analysis,research quality evaluation,and other in-depth exploration and discovery aspects. |