Font Size: a A A

Research On Text Mining In Biomedical Literature

Posted on:2009-08-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H YangFull Text:PDF
GTID:1118360242984638Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
It is well understood that the number of biomedical literatures is growing at an astounding pace and these vast collections of publications offer an excellent opportunity for the discovery of hidden biomedical knowledge by applying text mining technologies. Text mining in biomedical literature helps biomedical researchers efficiently find what they need and hidden biomedical knowledge from the huge amount of biomedical literatures mainly via natural language processing and machine learning.This dissertation firstly introduces text mining technologies and their applications in biomedical field. Then author's work in this field is introduced.A dictionary-based bio-entity name recognition approach using improved edit distance algorithm is presented. The approach expands dictionary via the abbreviation definitions identifying algorithm and improves the recall rate through the improved edit distance algorithm. Then some language knowledge-based methods including POS (Part of speech) expansion and the exploitation of the contextual cues and some rule-based methods including First-keywords and Post-keywords expansion and merge of adjacent entity names are applied to further improve the performance. Experiment results on JNLPBA2004 show that the above method could achieve a much better performance (68.48% in F-score) than the exact matching baseline (47.7%).As the current popular methods, the performance of machine learning techniques still has much space to be improved. This dissertation presents a conditional random field-based bio-entity name recognition approach and studies the methods of improving the performance by the exploitation of the contextual cues including bracket pair, heuristic syntax structure and interaction words cue. Experiment results on both JNLPBA2004 and BioCreative2004 task 1A datasets show that these methods can improve conditional random fields-based recognition performance by about 3 percentage points in F-score.Automatic extracting protein-protein interaction information from biomedical literatures can help to build protein relation network, predict protein function and design new drugs. Natural language processing based protein-protein interaction extraction methods usually can have relative good precise rate. This dissertation presents a Link Grammar based protein-protein interaction extraction approach. This approach applies conditional random fields model to tag protein names in biomedical text, then uses a Link Grammar parser to identify the syntactic roles in sentences and at last extracts complete interactions by analyzing the matching contents of syntactic roles and their linguistically significant combinations. Experiment evaluations with two other state of the art extraction systems indicate that this approach can achieve better performance.Machine learning and statistical methods usually can achieve higher recall rate. This dissertation also presents a SVM-based protein-protein interaction extraction approach. This approach uses four features including Words features, keyword features, entity distance feature and link path feature. In addition, the Link Grammar extraction result feature is introduced to improve the precise rate.The introduction of this feature improves much precise rate with little lose of recall rate. Experiment evaluations with other systems indicate that this approach can achieve much better recall rate and its F-score is also higher than others.Vast collections of biomedical publications offer an excellent opportunity for the automatic discovery of hidden knowledge. This dissertation describes the content and development in the research of the hidden knowledge discovery in biomedical literature and presents a biomedical hidden knowledge discovery approach. The approach extracts relative biomedical concepts from both MeSH (Medical Subject Headings) and free text (title and abstract) and achieves better extracting effect comparing with only extracting from one of them. In addition, by via of UMLS biomedical resources, this approach performs a query expansion and, therefore, improves the recall rate of relative records. The approach also reduces search space greatly through a semantic filter. Experiment on Fish Oils and Raynauds disease shows the effectiveness of this approach.
Keywords/Search Tags:Natural Language Processing, Text Mining, Entity Name Recognition, Entity Relation Extraction
PDF Full Text Request
Related items