Font Size: a A A

Research And Implementation Of Text Mining For Transcription Regulatory Information

Posted on:2010-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:Q YangFull Text:PDF
GTID:2178360275491628Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Biology data about regulatory mechanism of eukaryotes organism is increasing day by day.Transcription Factor is a kind of special protein,it regulates gene expression through binding the cis-regulatory elements which usually locate at the upstream of gene.Now,large number of transcription factor and cis-regulatory elements information is stored in documents.How to mining or extracting such kind of useful information is a big challenge in front of us.Man always extract such information by slowly reading instead of with the assistant of computer.To help biology experts,two main algorithms are proposed and implemented in this paper.The first algorithm proposed is used to mine text sentences in biology documents describing cis-regulatory elements.The paper extends the vector space model in traditional information retrieve system by adding two-word phrase dimension and part of speech information.With the trained text data,we train the system model which describes the cis-regulatory elements sentence context information.Given a text sentence,first algorithm will translate it into a extend vector space model and compare it with the trained system model.With the help of sentence similarity function,the sentence will be viewed as the the target when the score between system model and sentence bigger than given threshold.The second algorithm extracts more concrete information including transcription factor and binding site text segments.With the given trained data,algorithm constructs a context free grammar and use Earley algorithm to analyse the sentence structures,After extracting the noun phrases,verb phrases,algorithm builds the knowledge data base.Each text sentence to be analysed will be splited into several noun phrases and verb phrases,and those phrases will be compared with the knowledge data base.Only those noun phrases matched in the knowledge data base will be seemed as candidates.The two algorithms are implemented with Java language.Recall and precision are above 60%in corresponding experiments.
Keywords/Search Tags:Transcription factor, Cis-regulatory elements, text mining, data mining, Bioinformatics
PDF Full Text Request
Related items