Font Size: a A A

Reasearch On Biological Sequential Pattern Mining And Recognition

Posted on:2011-07-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:1118330338995795Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new comprehensive cross discipline involving biology, mathematics, physics, informatics and computer science. It plays an important role in the development of the life science and becomes the frontier of life science research. The core issue in bioinformatics is genome informatics which includes the obtaining, processing, storing, assigning and explaining of the genome information. Using computers and network as tools, based on the mathematical theory, methods and technology, genome informatics studies the biopolymers include the sequences, structures and functions of DNA and protein. The key issue in genome informatics is to understand the meaning of the order in nucleotide sequences, namely, to understand the exact locations of the genes in the chromosome and the functions of the DNA segments. These are very important in the research of disease gene of human being, the expression and function of gene and protein and the designing of pharmacy. To achieve those goals, pattern mining and recognition in biological data are two key techniques.Aimed at the key issues about the research on pattern mining and recognition in biological data, we mainly study on single and multiple biosequence mining, embedded frequent subtree mining on biological data, gene regulator binding sites identification and discrimination of CpG islands location. The achievements of this thesis are summarized as follows:(1) Present fast algorithms for single and multiple biosequence mining. Traditional mining frequent pattern algorithms frequently construct projected databases and generate lots of patterns with short length in the process of mining which cause low efficiency of mining. In order to overcome the shortcomings of the traditional algorithms, we respectively present algorithms SSPM and MSPM in this thesis. We use longer patterns for mining so as to avoid producing lots of patterns with short length. Meanwhile by the use of prefix tree of primary frequent patterns, we extend primary patterns which can avoid plenty of irrelevant patterns. The experimental results show that our algorithms not only improve the performance but also achieve effective mining results.(2) Present an efficient algorithm for mining frequent embedded subtrees in biological data. In addition, we apply this algorithm to mine frequent topological patterns of RNA molecule. The key factors which influence the performance of biological data mining approaches are the large-scale of biological data and the high similarities among patterns mined. In this thesis, we put forward an efficient algorithm named IRTM for mining frequent embedded subtrees in biological data. The algorithm advances a string encoding method for representing trees, and a scope-list for extending all substrings for frequency test. The IRTM algorithm adopts vertically mining approach, and uses some pruning techniques to further reduce the computational time and space cost. Experimental results show that IRTM algorithm can achieve significantly performance improvement over the classical works such as algorithms Patternmatcher and TreeMiner.(3) Present a gene regulatorary elements recognition algorithm based on ant-colony optimization. Most of current regulatorary elements recognition algorithms are easily to converge into a local optimum, and have high time complexity. Therefore, we propose a novel optimization algorithm named ACRR(ant-colony-regulatory-recognition) for regulatory elements recognition. Based on powerful optimization ability of ant-colony algorithm, the algorithm ACRR can solve the problem in a very high speed. Experimental results show that compared to other traditional algorithms, ACRR can have not only higher quality of solutions but also less computational complexity.(4) Present a novel method for CpG islands location identification based on the model of conditional random fields. In order to overcome the shortcomings of existing models such as the strong independence assumptions which generative model must have, the label-bias problem exhibited by maximum entropy markov model and other non-generative models, the method transforms the problem of CpG islands location identification into sequential data labeling. Based on the characters of CpG islands location, we design the corresponding feature functions and obtain the weights from the joint distribution over the label sequence given observation through a learning procedure on training data. Then according to the distribution model obtained, we can determine the labeled sequence with maximum probability and thereby identify the location of CpG islands. Experimental results on benchmark data sets show that our algorithm is more practicable and efficient than the method of HMM.
Keywords/Search Tags:data mining, bioinformatics, pattern mining, pattern recognition
PDF Full Text Request
Related items