Font Size: a A A

Study On Some Data Mining Methods For Biological Information And Their Application

Posted on:2006-08-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:J SongFull Text:PDF
GTID:1118360152485500Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
The sequencing of several genomes, including the human genome, has provided a vast amount of data which must be exploited. Bioinformatics is essentially the science of taking this. In Bioioformatics researchers study how to capture, manage, deposit, retrieve, analyze biological information enabling the discovery of encyclopedic biological knowledge. Data mining technology is used to extract potential and useful information from the databases, and is playing an increasingly important role in the study of Bioioformatics and bear fertile fruits. This paper investigates some data mining methods for bioinformatics and their application. The main work is summarized as followings:1. Both support vector machine and FDOD methods are applied to classification of homo-oligomeric proteins. Garian R used decision tree method to discriminate between homodimers and non-homodimers from the primary structure and showed that protein primary sequence contains quaternary structure information. In this present work, support vector machine and FDOD methods are applied to discriminating between homodimers and non-homodimers, where for training and testing protein primary sequences, their subsequence distributions act as input vectors. The classification results of the two methods are much better than that of the previous method on the same data set. The two methods are also applied to discriminating between homodimers, homotrimers, homotetramers and homohexamers from the protein primary structure, and the results are also good.2. A new v - SVM classifier based on linear programming is proposed. The v - SVM Classifier proposed by Scholkopf B has the advantage of controlling numbers of support vectors and errors compared to regular SVM, However, Its formulation is more complicated, which confines its applications. We present a new and simpler v - SVM classifier based on linear programming, The parameter v also has implicit sense of controlling numbers of support vectors and errors. Furthermore we can use effective linear programming solvers available. Numerical tests show that our v - SVM based on linear programming is much faster than original v - SVM and performs comparably in accuracy.3. A Newton method for parameterless robust linear programming support vector machine is presented. Parameterless robust linear programming support vector machine for classification,recently proposed by Mangasarian O L, solved this issue of determining the size of regular parameter. We have discussed the least 2 - norm solution of the parameterless linear programming problem and then presented a fast Newton method. The algorithm requires only a linear equation solver. The theory, numerical tests and application to gene expression data for cancer classification demonstrate that it is simple, fast and easily accessible.4. FDOD is applied to analysis of similarities of DNA sequences. Comparison of sequences is one of the most common study means in Bioinformatics. Comparison of sequences aims at analyzing the similarity and dissimilarity of DNA sequences. It mainly depends on sequence alignment, which has some shortcomings. So people try to develop new methods. FDOD is used to analyze the similarities of DNA sequences from the primary structure. The effect of residue order along the sequence is taken into account in some extent. For different length of subsequence the approach is illustrated through the examination of similarities among the coding sequences of the first exon of β-globin gene of 11 different species. The resultsdemonstrate that FDOD method is effective.5. A novel 2-D graphical representation of DNA sequences and corresponding numerical characterization approach are proposed, and then applied to examining the similarities of DNA sequences. Graphical representations of DNA sequences allow visual inspection of data, and can facilitate the analysis, comparison and identification of such sequences. This paper considers a novel 2-D graphical representation of DNA sequences according to homomorphism in Algebra and chemical structure classification of...
Keywords/Search Tags:Bioinformatics, data mining, support vector machine, FDOD, protein, DNA, graphical representation
PDF Full Text Request
Related items