Font Size: a A A

Optimization And Application Of DNA Sequence Word Segmentation Method

Posted on:2014-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:L N ZhangFull Text:PDF
GTID:2250330422451942Subject:Computer technology
Abstract/Summary:PDF Full Text Request
There is a close relationship between the DNA sequence and native language.The DNA sequence contains all the genetic information, which controls genetic andevolutionary. People even look at the DNA sequence as a kind of "life language" todescribe the life, and try to analyze the structure and function of the DNA sequencefrom the linguistic point.To read the DNA sequence we need to analyse the entire DNA sequence fromthe perspective of the words. This paper attempts to draw in the native languageword segmentation method for word segmentation of the DNA sequence, andthrough the word segmentation to analyze the structure and composition of theDNA sequence.This paper firstly implements using the CRFs model for training and doingword segmentation on the DNA sequence. Because the information of known DNAsequence’s functional words is too little, the word segmentation result is relativelypoor. Secondly, this paper proposes native language and DNA sequence crossoverstudy method. On the one hand, this paper utilizes native language word’sinformation to train CRFs model and do word segmentation on DNA sequence. Onthe other hand, this paper codes cross the native language and DNA sequence, andthen uses the encoded sequence for word segmentation. The result of crossoverstudy method word segmentation obtains75%recall rate, the method of crossovercode increases5%recall rate of the word segmentation result compared withcrossover study method. Once again, under the assumption that different functionalareas of the DNA sequence have different functional words, this paper proposes aword segmentation method based on different local areas of DNA sequence. Thispaper completes different regions’ word segmentation. These regions include thepromoter region, the extended area of transcription factor binding sites, the non-coding region and the HOX genome region. The different regions’ results obtain thesame recall rate with the crossover study method. Finally, this paper proposesdifferent word segmentation result optimal merging algorithm based on the thinkingof maximum probability word segmentation. Not only the algorithm result wins82%recall rate and improves the word length, but also gets good boundarymatching with known functional words.This paper also accomplishes two applications of DNA sequence wordsegmentation result, they are characterizing thesaurus construction, to find co- occurrence words, building species characteristic dictionaries, building commonfeatures dictionaries between species and building species-specific dictionary.These dictionaries provide a convenient way for our future use of DNA sequenceanalysis. To find co-occurrence words plays a very important role in our analysiswith a plurality of co-regulation of the functional device in DNA sequence.
Keywords/Search Tags:DNA sequence word segmentation, CRFs, cross-language wordsegmentation, local specific area word segmentation, maximumprobability word segmentation
PDF Full Text Request
Related items