| TCM is an important component of traditional medcine which has some Chinese characteristic. During 2500 years of practice, it has clinical effectives and characteristics in disease diagnosis and treatment. Internet contains plenty of medical information and whose resources are still growing explosively, how to get medical knowledge from mass data has important significance for TCM informatization construction and clinical diagnosis and treatment. Web Ming is an efficient method to resolve the problem. It uses the basic theory of data mining for discovering potential and valuable knowledge from a great quantity of half-structural web pages.Web Mining has been an important study direction in recent years.In this paper, using Web Page Classification technique and Information Extraction technique, we designed a TCM Knowledge Discovery System, which helps TCM informatization construction and clinical diagnosis and treatment. The main research contents of this paper are as follows:(1) Based on deep study of Web Page Classification, Chinese text categorizaiton has gradually become popular in Web Mining, Its key technology contains text expression, weight numeration, feature extraction and classification algorithm. This article uses Maximum Entropy based on Chinese character feature extraction method i to get Web Page about medical science.(2) Named entity recognition has particular sigificance for information retieveval, machine translation, the automatic indexing of documents. This article introduce three Named entity recognition methods based on statistics. Compared with other modes used in sequencial labeling methods we descibe the main characteristic of CRF modes. We use CRF methods to extract the disease name from Web page.(3) We have implemented the key modules of TCM Knowledge Discovery System, including Web page collection module, Web page pretreatment module, Web page classificaton modules, entity name recognition module, relation building module. |