Font Size: a A A

Information extraction to enable faceted search over large text document collections

Posted on:2011-07-02Degree:Ph.DType:Dissertation
University:Arizona State UniversityCandidate:Ahmed, Syed ToufeeqFull Text:PDF
GTID:1468390011971504Subject:Computer Science
Abstract/Summary:
Recent advances in computational and biological methods have remarkably changed the scale of biomedical research, and with it the unprecedented growth in both the production of biomedical data and amount of published literature discussing it in last two decades. Complete genomes can now be sequenced within months and even weeks; computational methods can expedite the identification of tens of thousands of genes and large-scale experimental methods. The data generated by these experiments is highly inter-connected; the results from sequence analysis and micro-arrays depend on functional information and signal transduction pathways cited in peer-reviewed publications for evidence.;Imagine a biologist researching the cure for a disease, such as leukemia, she currently has to read all the research published that deal with this disease, and find all the proteins, genes and other information, like drugs and chemicals, that will help her better understand the molecular connections (pathways) between these substances and the disease. Even though many systems aid in accessing and browsing through this myriad collection of documents, the vastness and depth of this information overload can be overwhelming. An automated extraction system coupled with a cognitive search and navigation service over these document collections would not only save time and effort, but also pave the way to discover hitherto unknown information implicitly conveyed in the texts.;This dissertation discusses practical information extraction systems that can also populate faceted search and navigation systems to enable discovery of important semantic relationships between entities such as genes, diseases, drugs, and cell lines. This dissertation presents an automated system to extract bio-molecular events from bio-medical text. The system first semantically classifies each sentence to the class type of the event mentioned in the sentence, and then using class-specific rules, it extracts the participants of that event. An integrative framework to fuse faceted search with information extraction is also proposed to provide a search service that addresses user's desideratum of "complete-ness" of query results, not just the top-ranked ones. To demonstrate the utility of this framework, the dissertation also details a prototype enterprise quality search and discovery service that helps life sciences researchers with a guided step-by-step query refinement, by suggesting concepts enriched in intermediate results, and thereby facilitating the "discover more as you search" paradigm that is powered by information extraction.
Keywords/Search Tags:Search, Information extraction
Related items