Font Size: a A A

The Research On Biological Sequential Pattern Mining And Clustering

Posted on:2008-01-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y XiongFull Text:PDF
GTID:1118360242973001Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Bioinformatics is a multidisciplinary research field to abstract knowledge and principles from biological data, which applies mathematics, computer science and life science, and etc. It is one of current research issues. Data mining is an important technique for exploring the common rule from volume data. It is one of the strongest data analysis techniques in computer science. It has also become main data analysis technique adopted in Bioinformatics. Biological sequence data is critical in biological data. The research on biological sequence data is also an active research area in Bioinformatics. The key point in biological sequence data mining is how to design effective mining algorithms. It covers two aspects. On one hand, it is difficult to give appropriate biological explanations for the mining results, for these algorithms haven't considered related domain knowledge. On the other hand, current sequence data mining algorithms cannot work efficiently on large scale biological sequence data directly because of specific characteristic possessing in biological sequence data.The main objective of biological sequence data mining is to identify functional elements, investigate relationship in biological sequences and so on. Biological sequential pattern mining and clustering are two important research aspects in biological sequence data mining. Biological sequential pattern mining is a key technique to identify gene and functional elements and then comprehend sequence functions. Sequential patterns can be used to describe the feature of sequence, based on which we can design similarity measure between sequences. Biological sequential pattern mining technique is also the basis of association analysis. Biological sequence clustering is a primary method to investigate relations among sequences and then interpret evolution relations. The clustering results are clusters of sequences which share common characteristics. Furthermore, the precision of sequential pattern mining results can be improved when dealing with such clusters. And biological sequence clustering technique is also a preprocess step which is used as biological sequence classification and outlier analysis. Sequential pattern mining and clustering play important roles in biological sequence data mining research area.In order to improve existing biological sequential pattern mining and clustering techniques, we mainly study their effectiveness and efficiency in this thesis. Aimed at existing problems, we present some effective measures and algorithms to meet various demands. In addition, we further make some discussions on the problem of how to enhance the efficiency when dealing with biological sequence data from expression and storage aspects, and then give a novel biological sequence data model. Finally, we apply above methods to implement our transcription regulation sequence data mining system. The achievements of this thesis are summarized as follows:(1) Propose a multi-supports measure of biological sequential pattern mining, and design a corresponding sequential pattern mining algorithmCurrent sequential pattern mining algorithms define support as sequence numbers which contain patterns (or percentage) while not considering frequencies in each sequence. Therefore, some biological significance results cannot be mined. In this thesis, we investigate the problem of biological sequential pattern mining, and present a multi-supports measure of biological sequential pattern mining including distribution, location and globe support. On this basis, we design a mining algorithm BioPM. BioPM realizes sequential pattern mining according to various combinations of above supports, which makes the results meet various application demands including conserved sequential pattern, repetitive sequential pattern and combination sequential pattern mining, and so on. The experiment results on real data sets demonstrate that the runtime of BioPM is much less than previous algorithms and the outcome sequential patterns are more acceptable for biologists.(2) Present a similarity measure function of protein sequences, based on which design the corresponding clustering algorithmBiological sequential patterns can be used to describe the feature of sequences, and used as foundation for similarity measure design. However, the clustering quality may be affected because existing methods have not considered the whole and part characteristic of the sequences, In this thesis, we study the problem of protein sequences clustering, and present a similarity measure function for protein sequences Bio_Sim(), based on which we design a clustering algorithm ProFaM. ProFaM adopts multi-supports-based sequential pattern mining algorithm to extract patterns which can capture the protein sequence characteristic (whole and part), and then construct a similarity measure function Bio_Sim(). The clustering process adopts shared nearest neighbor method. Different from traditional measurement assumes homologous segments should be adjacent conserved, ProFaM based on Bio_Sim() can express genetic recombination and gives better explanation on the characteristic of protein family. The experimental results show that ProFaM can be well applied to protein family analysis.(3) Present a similarity measure function of gene sequences, based on which design a corresponding clustering algorithmDue to the various features between gene and protein sequences, the clustering demands may be different. Recent research shows that sequences functions may be different even if they are similar. The existing clustering methods which merely using sequences information are likely invalid. In this thesis, we study the problem of co-expression gene clustering, and present a similarity measure of gene sequences called 'N-Same Dimensional Tendency Similar' according to the co-expression characteristic among gene sequences. Besides, we design a clustering algorithm Gen-Cluster to get N-same dimensional tendency clusters, i.e., co-expression gene sequences clusters. Compared with other gene sequences clustering methods which merely make use of sequences information, N-same dimensional tendency clusters can give better explanation on gene sequences functions. Experiments show that Gen-Cluster improves performance and gets satisfactory results.(4) Present a novel biological sequence data model BioSegThe expression and storage manner of biological sequence data is critical for accessing and dealing with them. The existing storage manner using text type is one of major reason which makes the low efficiency of biological sequence process. In this thesis, we study the biological sequence data management and query problem. We present a novel biological sequence data model and give corresponding operation algebra. Query capability on BioSeg is more efficient and feasible than previous storage manner using text type.(5) Design and implement a transcription regulation sequence data mining system TBMinerThe study of transcription regulation is one of current research issues in post-genomics. Sequential pattern mining and clustering are important to predict cis-regulation elements (transcription factor binding sites). In this thesis, we apply above sequential pattern mining and clustering algorithm to realize cis-regulation elements prediction. We design and implement a transcription regulation sequence data mining system TBMiner. It provides favorable bioinformatics tool for biologists to study transcription regulation rule.
Keywords/Search Tags:data mining, sequential pattern, clustering, biological sequence, bioinformatics
PDF Full Text Request
Related items