Font Size: a A A

Study On Topic Model And Its Application To TCM Clinical Diagnosis And Treatment

Posted on:2012-03-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X P ZhangFull Text:PDF
GTID:1114330335451327Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Topic models could be used to extract topics which are hidden in the documents (or discrete corpora), where each topic is a multinomial distribution over words semantically related each other. The main purpose of topic models is to explore statistical laws hidden in the discrete corpora and to express these information directly using topics, and then the topics obtained could be used for information retrieval, classification, clustering, abstract extraction, similarity and relativity estimation and so on. Topic model has recently been a new research issue in domains of text mining and information retrieval, etc.Traditional Chinese Medicine (TCM), an important component of traditional life sciences, has significant clinical efficacy in diagnosis and treatment of diseases. Large amount of clinical data, containing lots of knowledge and rules that are consistent with TCM theory, have been accumulated during thousands of years'TCM practice. In the trend of TCM informatics, it is very important to use modern techniques for mining the rules of TCM diagnosis and treatment hidden in clinical data. Although lots of methods, such as cluster analysis, association rules, regression analysis and discriminant analysis, have been used to study TCM theory, and some research progresses have been made, it is still difficult to reflect the TCM characteristics that are semantic complexity and systematicity of diagnosis and treatment.In this dissertation, we firstly introduce topic models to the study of the rules of TCM clinical diagnosis and treatment. The motivation is that we think not only topic models could capture the semantic characteristics hidden in TCM clinical data, but also there are relatively consistent route between the process of inference and generative of topics in the topic models and the process of "syndrome differentiation and treatment" which is described as "inspect the pulse-symptom, infer the diseases, and then to treat them" in the famous book Treatise on Exogenous Febrile Diseases. Both of the routes are from observable variable to latent variable to observable variable. We apply topic models to analyzing the clinical data of type 2 diabetes mellitus (T2DM), the clinical data of coronary heart disease and the TCM literature. Experiments indicate that the topic models could extract meaningful clinical law of diagnosis and treatment. It can provide a kind of academic method for TCM clinical study, and offer a kind of impersonality foundation for TCM clinical diagnosis and treatment. The main contributions of this dissertation are as follows:(1) Topic models represented by Latent Dirichlet Allocation (LDA) are recently one of the new research focuses in the domain of text mining and information retrieval. The formed background and development process of topic models, general inference methods of LDA and some typical topic models are systematically summarized in this dissertation. These contents are the basis of the research of this thesis and the reference of other researchers in the future.(2) We propose feature weighting mechanism in LDA model. When learning TCM clinical symptom topics by original LDA model, we found that the word distributions in the topics incline to high frequence words. That means those feature words representing topics are submerged by few high frequence words, which result in somewhat poor ability of elucidation and discrimination of the topics and rational allocation of other words on the topics. Therefor, we weight for the feature words using IDF method in standard text data, and then for TCM clinical data, we propose a novel feature words weighting method by Gauss function. The experiments indicate:weighted LDA model could improve the ability of elucidation and discrimination of topics; improve the modeling speed; improve Support Vector Machine (SVM) classification accuracy in Newsgroups dataset; reduce the perplexity under appropriate condition.(3) Aiming at the problem that the number of topics can't be automatically determined in LDA model, a latent topic model is proposed by combining the similarity between words and Chinese Restaurant Process (CRP). At the same times, aiming at the problem that hard to rationally set the two Dirichlet hyperparameters during Gibbs sample of topic models, a novel method of setting the Dirichlet hyperparameters is put forward. Experiments indicate:the proposed model could adaptively update the contents and determine the rational number of topics; the method of setting hyperparaments is conveniently fit to different datasets and the low perplexity is obtained.(4) Analyzing the relationships between topic models and TCM "syndrome differentiation and treatment", we propose Symptom-Herb-Diagnosis Topic (SHDT) model based on LDA model and Author-Topic model, to automatically extract the topic structure among symptoms, herb combinations, and to explore the common relationships among clinical meaningful multi-entity. In the clinical data of Type 2 Diabetes Mellitus (T2DM), the SHDT model capture some meaningful diagnosis and treatment topics (clusters), which clinically indicated some important medical groups corresponding to comorbidity diseases (e.g. diabetic kidney diseases and diabetic peripheral neuropathy). The experiment demonstrates:a class of symptom or the combination of symptoms only give an manner or evidence for classification of population/diease, and they could not be explain that there is distinct syndrome or diagnosis correspondingly, and there exist individualised TCM therapies. At the same time, there exist common TCM diagnosis and treatment rules. So the results demonstrate that this method is helpful for opening out the distribution character of symptoms of diseases, TCM diagnosis and treatment rules.(5) For complex disease, such as T2DM, there is much kind of comorbidity diseases. And then, there are hierarchical relationships among main symptoms and concomitant symptoms of diseases. At the same times, there is hierarchical structure among herbs to cure above disease, which means the prescription modification according to symptoms. For opening out the hierarchical latent topic structures both symptoms and their corresponding used herbs in the TCM clinical data, we propose a Hierarchical Symptom-Herb Topic (HSHT) model. The HSHT model is a combination of Hierarchical Latent Dirichlet Allocation (HLDA) model and Link Latent Dirichlet Allocation (LinkLDA) model. Using HSHT model in clinical T2DM, we get meaningful hierarchical topic structure of symptoms and corresponding herbs. We propose a novel statistical method for research TCM clinical rules of modification according to symptoms of prescriptions.
Keywords/Search Tags:Text Mining, Topic Model, Latent Dirichlet Allocation, Chinese Restaurant Process, Tradition Chinese Medicine, Symptom-Herb-Diagnosis Topic Model, Hierarchical Symptom-Herb Topic Model
PDF Full Text Request
Related items