Font Size: a A A

Research On Model And Algorithms For Mining Disease-Centric Relationships In Biomedicine Literatures

Posted on:2014-02-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:L YangFull Text:PDF
GTID:1228330398986736Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapidly increasing amount of literature in biomedical domain promotes the application of text mining. As one of the hot topics, biomedical text mining could get useful knowledge from a large number of literatures rapidly and efficiently. Biomedical text mining techniques contain information retrieval, text classification, named entity recognition, relationship extraction and hypothesis generation. With the rapid development of gene techniques, recognizing pathogenetic mechanism from molecular level becomes very important. Relationship mining of disease and building disease centric network from biomedicine literature could provide evidence of hypothesis generation for scientists. Mining hidden information of disease makes good sense for the disease prevention and development of new drugs. After a good performance on biomedical named entity recognition, the ontology annotation would be carried out on a result of a classification for literature. Subsequently, relationships between diseases and other entities would be predicted.The most methods in biomedical named entity recognition are single-phase.That is, making term boundary detection and semantic labeling into one task. Semi-Markov conditional random fields model (semi-CRFs) put the label to a segment not a single word which is more natural than the other machine learning methods. We represent a two-phase approach based on semi-Markov conditional random fields model (semi-CRFs) and explores novel feature sets for identifying the entities in text into5types:protein, DNA, RNA, cell_line and celltype. Our approach divides the biomedical named entity recognition (NER) task into two sub-tasks:term boundary detection and semantic labeling. At the first phase, term boundary detection sub-task detects the boundary of the entities and classifies the entities into one type C. At the second phase, semantic labeling sub-task labels the entities detected at the first phase the correct entity type. We explore novel feature sets at both phases to improve the performance. Our experiments based on semi-CRFs without deep domain knowledge and post-processing algorithms gets an F-score of74.64%on the JNLPBA2004corpus, which outperforms most of the state-of-the-art systems.Up to now, the biomedical text mining for diseases is limited to the recognition of disease names. Few work focus on the type of diseases and relations between diseases. Only the recognition of the biomedical concepts in literature is not enough, annotations and normalizations of the concepts with normalized Metathesaurus get even more important. We propose a system to annotate the literature with normalized Metathesaurus. First, a two-phase semi-Markov Conditional Random Fields (semi-CRFs) is used to recognize the disease mentions, including the location and identification. Then, we adapt the Disease Ontology (DO) to annotate the diseases recognized for normalization by computing the similarity between disease mentions and concepts. According to the similarities, the disease mentions are denoted as disease concepts and instances distinctively. The experiments carried out on the Arizona Disease Corpus show that our system makes a good achievement and outperforms the other works.There is a lot of knowledge hidden in biomedicine literatures. With the ever increasing amount of biomedicine literatures, mining the relations automatically is very urgent. The relations between diseases and gene functions are waiting to be mining. We propose a method to mine relations between diseases with common gene functions in the literature with normalized Metathesaurus. First, a two-phase semi-CRFs model is used to recognize the disease mentions and gene function mentions, including the location and identification. Then, we adapt the Disease Ontology (DO) and the Gene Ontology (GO) to annotate the diseases and gene functions recognized for normalization by computing the similarity between mentions and concepts. According to the similarities, the mentions are denoted as concepts and instances distinctively. Thirdly, we build a network and measure relations between diseases by computing similarities between common sub-graphs. The experiments carried out on a corpus randomly selected by GoPubMed with disease and the three domains in GO. The performance shows a lot of hidden relations between diseases and gives an explanation.Finally, hypothesis generation of diseases should work. We build semantic networks among diseases, gene functions and drug entities, extract sub-semantic networks about diseases and get semantic relationships among diseases and other entities through text. We make semantic extension to entities using topic model. The documents are classified into four topics:diseases, diseases and gene functions, drugs and gene functions, diseases and drugs. We mine hidden relationships among diseases according to co-occurrence in sentences and semantic association of entities.Hence, the disease network building by the above methods has a good application. It could predict hypothesis among diseases, drugs, gene functions, then provides evidence for test with researchers.
Keywords/Search Tags:Text Mining in Biomedicine Literatures, Semi-Markov Conditional RandomFields, Two-Phase, Ontology Annotation, Semantic Mining, DiseaseRelationship
PDF Full Text Request
Related items