Font Size: a A A

Extraction Of Disease-Related Genes From The Literature

Posted on:2017-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:D XuFull Text:PDF
GTID:2370330590991532Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Alone with the rapid growth of biomedical literatures,it becomes increasingly difficult to find out useful knowledge from them.In this paper,we proposed a simi-automatic framework to extract gene-disease associations from a huge amount of biomedical literatures based on the technology of text mining and natural language processing.Our association mining framework consists of three phases,name entity recognition,as-sociation detection and ranking.In the name entity recognition(NER)phase,we extended the exsiting gene and disease databases.Then,we designed a MEDRA-based longest match strat-egy to recongnize gene and disease term in Medline abstract and article titles.Also,A number of heuristic rules are applied to filter out some error terms which are recognized by the dictio-naries.Such hybrid technique achieves 0.84 Fl-score in recognizing genes and diseases from the MEDLINE abstracts.In the association detection phase,all recognized gene-disease pairs that co-occur within the same sentence are considered as candidate evidences.A binary SVM classifier is used to determine the plausibility of the candidate pair.Two types of features are extracted by this classifier.The local lexical features are words surrounding the gene or the disease terms in the original text.The global syntactic features are unigrams,bigrams and trigrams drawn from 1)the shortest path between the gene and the disease terms in the dependency tree,and 2)the path between the least common ancestor of the two terms and the root of the dependency tree.Ten-fold cross validation of the model with 1000 positive and 1000 negative samples shows an Fl-score of 0.934.In the ranking phase,each of the positive pairs can be ranked by three methods.The basic method is by the co-occurrence frequency.The second method is to weigh each co-occurrence by the page rank of the paper from which the evidence was extracted,in a paper citation net-work constructed from PubMed.The last and most advanced method considers the duplicated evidence published by the same author,and thus suppresses the contribution of such evidence.Our evaluation of the 10 diseases shows that the MRR scores of the above three rankings are 0.249,0.281 and 0.293,respectively.In addition,if we consider a disease to gene association problem as an information retrieval problem,the Fl-score for the top 50 genes associated to a disease by the third ranking methods reaches 0.259,which are significantly higher than existing systems on similar tasks.
Keywords/Search Tags:Gene Disease Association, Text Mining, Name Entity Recognition, Relation Extraction
PDF Full Text Request
Related items