Font Size: a A A

Issues In TCM Text Mining

Posted on:2005-12-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Z ZhouFull Text:PDF
GTID:1118360125969787Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Text Mining is a new interdisciplinary field that combines the disciples of Artificial Intelligence,Machine Learning, Data Mining and text automatic processing techniques (e.g. InformationExtraction, Information Retrieval and Text Classification). Many researches have been intensivelyconducted on it. It is said that Text Mining is the natural extension of traditional KDD tounstructured text data. However, Text Mining is still in its infancy. There is much work to be doneon the approaches and applications of Text Mining. Traditional Chinese Medicine (TCM) is animportant component of traditional medicine in life science, which has some special Chinesecharacteristics. TCM has been played a significant role in the healthcare life of Chinese people. Ithas clinical effectives and characteristics in disease diagnosis and treatment, and Chinese MedicalFormula&drug therapies. Immense high valuable medical data has been accumulated during theseveral thousand years' practice. The huge data storages build the foundations of KDD and push itto significant practical use. Due to rare text mining studies in TCM field, this thesis gives severaltechniques and applications in TCM text mining researches. These studies include as follows: Focusing on the study of character based Chinese text classification. A systematic comparativeexperiment has been conducted on character based Chinese text classification, and the resultsshow that character is an efficient and effective feature in Chinese text classification,furthermore, a novel feature generation method named Distributional Character Clustering isproposed and gets a state of the art performance. It has some special advantages such as verylow and almost fixed dimensionality (e.g. 102 features), no word segmentation and with highperformance (DCC based NB gets the similar performance as word based SVM). This is anovel promising feature representation method in Chinese text classification. Due to the necessarily of extraction of TCM terms such as Chinese Medical Formula anddiseases names from TCM bibliographic literature, this thesis also focuses on boostrappingmethod to terminology extraction. A new bootstrapping method called Bubble-bootstrappingand ATP is proposed. It is a scalable and almost unsupervised information extraction methodwith no need of any shallow Chinese NLP techniques and labeled training corpus. Theexperiments on 400,000 bibliographic records show that the ATP based Bubble-bootstrappingmethod gets very high performance (about 99% precision and 65% Fl score). Furthermore, itgets about 80% Fl score when applied to automatic subject indexing as subheading extractionmethod. Focusing on drug component frequent itemset discovery in clinical Chinese Medical Formula from literature. This thesis proposes the concepts of CMF drug plant family composition and gives knowledge discovery study on it. A prototype system named MeDisco using text mining techniques is proposed, which aims to implement drug component frequent itemset discovery on clinical Chinese Medical Formula from TCM bibliographic literature. The experiments show that CMF drug knowledge discovery using text mining is practical useful and valuable. There exist some drug plant family compositional rules on CMF use, and they can aumatically be mined from data.Another text mining system called MeDisco/3S has been developed to uncover the hidden knowledge among TCM literature and modern biomedical literature (Medline), which gives an approach to find the functional relationships between TCM Symptom Complex and gene. MeDisco/3S will propose a promising intelligent knowledge discovery platform to facilitate the interdisciplinal researches of life science. It is the first example effort of biomedical literature discovery and information integration in life science.
Keywords/Search Tags:Text Mining, Information Extraction, Knowledge Discovery in TCM, Chinese TextClassification, Distributional Character Clustering, Bubble-bootstrapping, ATP, Frequent Itemsets, Chinese Medical Formula, Relationship of Symptom Complex and Gene.
PDF Full Text Request
Related items