Font Size: a A A

Design And Implementation Of DNA Word Segmentation And Semantic Analysis System

Posted on:2015-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:J W ZhangFull Text:PDF
GTID:2298330422492273Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of bioinformatics, more and more researchersfind that DNA sequence which carries the genetic information of human beingcontains many linguistic features and has numerous similarities with human naturallanguages through a variety of statistical methods. As we all know, there is a lot ofissues in using traditional biological means to research and explore DNA sequence,such as time-consuming, high-cost and complicated procedures. However, theseproblems can be well solved by using Natural Language Processing techniqueswhich can also bring people a new direction for a deeper understanding of the DNAsemantic information.This paper builds a DNA word segmentation and semantic analysis system,attempting to deal with DNA sequences by Natural Language Processing techniques.Its functional structure is divided into two parts, word segmentation and textsimilarity calculation.In the word segmentation, this paper proposes a method based on votingstrategy under the application of multiple segmentation models with multi-features.This approach makes the gist of the segmentation more sufficient by using a varietyof features, including boundary entropy, distance entropy, normalized clusteringmeasure as well as z-score in Markov model. In addition, this method realizesseveral segmentation methods, including Conditional Random Fields, SupportVector Machine and Maximum Weight Path proposed in this paper. During theprocess of machine learning, this approach optimizes the corpus by just dealingwith the DNA sequence around Transcription Factor Binding Site, which isexpected to preclude the much noise in DNA sequence and contributes a moreoutstanding effect of machine learning. Finally, this method combines the results ofdifferent models and gets an optimal segmentation result by using a strategy basedon the multiple voting, leading an improvement of the recall rate of DNA wordsegmentation to82.7%, which performs better than CRF segmentation method.In the semantic analysis of DNA, this paper firstly builds a corpus includingseveral groups of genes with similar function and several groups of random DNA sequence. Then, segment the sequences and get the Word Order which occur morefrequently in functional genes than in non-functional DNA sequences. Finally, thesystem is able to calculate effectively the semantic text similarity of DNAsequences with different functions, and thus meeting the basic needs of thesemantic analysis of different DNA sequences.In summary, this paper designed and implemented a DNA segmentation andsemantic analysis system based on the elaborate analysis of DNA sequences, andalso evaluated the result of system in detail. The results of the experiment showedthat the system did well in word segmentation and semantic analysis and it could bean effective tool to help researchers understand the genetic information of DNAsequences deeply.
Keywords/Search Tags:DNA word segmentation, voting strategy, text similarity, wordorder factor
PDF Full Text Request
Related items