Font Size: a A A

Research On Model And Algorithms For Mining Disease-centric Relationships In Biomedicine Literatures

Posted on:2014-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:L YangFull Text:PDF
GTID:1268330422962328Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapidly increasing amount of literature in biomedical domain promotes theapplication of text mining. As one of the hot topics, biomedical text mining could getuseful knowledge from a large number of literatures rapidly and efficiently. Biomedicaltext mining techniques contain information retrieval, text classification, named entityrecognition, relationship extraction and hypothesis generation. With the rapiddevelopment of gene techniques, recognizing pathogenetic mechanism from molecularlevel becomes very important. Relationship mining of disease and building diseasecentric network from biomedicine literature could provide evidence of hypothesisgeneration for scientists. Mining hidden information of disease makes good sense for thedisease prevention and development of new drugs. After a good performance onbiomedical named entity recognition, the ontology annotation would be carried out on aresult of a classification for literature. Subsequently, relationships between diseases andother entities would be predicted.The most methods in biomedical named entity recognition are single-phase.That is,making term boundary detection and semantic labeling into one task. Semi-Markovconditional random fields model (semi-CRFs) put the label to a segment not a singleword which is more natural than the other machine learning methods. We represent atwo-phase approach based on semi-Markov conditional random fields model (semi-CRFs)and explores novel feature sets for identifying the entities in text into5types: protein,DNA, RNA, cell_line and cell_type. Our approach divides the biomedical named entityrecognition (NER) task into two sub-tasks: term boundary detection and semanticlabeling. At the first phase, term boundary detection sub-task detects the boundary of theentities and classifies the entities into one type C. At the second phase, semantic labelingsub-task labels the entities detected at the first phase the correct entity type. We explorenovel feature sets at both phases to improve the performance. Our experiments based onsemi-CRFs without deep domain knowledge and post-processing algorithms gets anF-score of74.64%on the JNLPBA2004corpus, which outperforms most of thestate-of-the-art systems.Up to now, the biomedical text mining for diseases is limited to the recognition ofdisease names. Few work focus on the type of diseases and relations between diseases.Only the recognition of the biomedical concepts in literature is not enough, annotationsand normalizations of the concepts with normalized Metathesaurus get even moreimportant. We propose a system to annotate the literature with normalized Metathesaurus. First, a two-phase semi-Markov Conditional Random Fields (semi-CRFs) is used torecognize the disease mentions, including the location and identification. Then, we adaptthe Disease Ontology (DO) to annotate the diseases recognized for normalization bycomputing the similarity between disease mentions and concepts. According to thesimilarities, the disease mentions are denoted as disease concepts and instancesdistinctively. The experiments carried out on the Arizona Disease Corpus show that oursystem makes a good achievement and outperforms the other works.There is a lot of knowledge hidden in biomedicine literatures. With the everincreasing amount of biomedicine literatures, mining the relations automatically is veryurgent. The relations between diseases and gene functions are waiting to be mining. Wepropose a method to mine relations between diseases with common gene functions in theliterature with normalized Metathesaurus. First, a two-phase semi-CRFs model is used torecognize the disease mentions and gene function mentions, including the location andidentification. Then, we adapt the Disease Ontology (DO) and the Gene Ontology (GO)to annotate the diseases and gene functions recognized for normalization by computingthe similarity between mentions and concepts. According to the similarities, the mentionsare denoted as concepts and instances distinctively. Thirdly, we build a network andmeasure relations between diseases by computing similarities between commonsub-graphs. The experiments carried out on a corpus randomly selected by GoPubMedwith disease and the three domains in GO. The performance shows a lot of hiddenrelations between diseases and gives an explanation.Finally, hypothesis generation of diseases should work. We build semantic networksamong diseases, gene functions and drug entities, extract sub-semantic networks aboutdiseases and get semantic relationships among diseases and other entities through text.We make semantic extension to entities using topic model. The documents are classifiedinto four topics: diseases, diseases and gene functions, drugs and gene functions, diseasesand drugs. We mine hidden relationships among diseases according to co-occurrence insentences and semantic association of entities.Hence, the disease network building by the above methods has a good application. Itcould predict hypothesis among diseases, drugs, gene functions, then provides evidencefor test with researchers.
Keywords/Search Tags:Text Mining in Biomedicine Literatures, Semi-Markov Conditional RandomFields, Two-Phase, Ontology Annotation, Semantic Mining, DiseaseRelationship
PDF Full Text Request
Related items