Frequency Distribution Of Biological Subsequence And Classification Model For Tumor Subtype

Posted on:2008-08-08

Degree:Doctor

Type:Dissertation

Country:China

Candidate:S L Wang

Full Text:PDF

GTID:1118360242499344

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The exponential growth of the cumulative biological data has attracted a number of scientists to be engaged on the study of bioinformatics which has become the focus of world's attention. The two basic problems on bioinformatics are investigated in this dissertation. One is the occurrence frequency distribution of k-mers in whole chromosome, and another is the molecular diagnosis of tumor based on gene expression profiles. The satisfactory experiment results have been achieved from the study on the two problems, which has a profound implication theoretically and realistically and which is helpful to the development of bioinformatics. The visual and fractal representation of DNA sequences, the occurrence frequency distribution of k-mers in DNA sequences and the counting algorithm for k-mers in DNA sequences are deeply investigated to the first basic problem. The feature extraction and informative gene selection for tumor classification are studied on the second problem.The visual representation of DNA sequences plays a profound role in DNA structure and its function, which is especially helpful to the recognition of repeatitive subsequences, the recognition of intron and exon, and the evolution of DNA sequences, etc. Firstly, the two methods including Hao's method and the classic chaos game representation method that can generate the fractal image of DNA sequences are introduced, and further the comparison and contrast between the two methods are provided in details. On the basis of these work, the palindromes are discussed in forbidden k-mers. Secondly, after the mathematical principle of the iterated function system generating fractal attractor is introduced, the chaos automata is defined according to the Moore finite state machine together with the iterated function system; and the method of chaos automata generating fractal image driven by DNA sequences is investigated deeply and further. At last, the method of generating fractal image driven by codon sequences in DNA sequences is proposed and investigated, and further many unsolved problem are presented.Furthermore, on the basis of Hao's method which allows the depiction of the occurrence frequency of k-mers in the form of fractal image, a novel method that can generate 3D occurrence frequency distribution map (OFDM) of k-mers in genome is proposed, and its advantage is that the difference in occurrence frequency of k-mers is obviously exhibited for biologists. Then the partition criterion for occurrence frequency segment is proposed according to the 1D logarithm histogram into which is transformed from 3D OFDM. The 1D histogram can show the local feature of the occurrence frequency distribution (OFD) of k-mers, i.e. the OFD of k-mers in ultrahigh frequency segment appears discontinuous in integer. The palindromes in forbidden k-mers are further studied in forbidden segment. The phenomena of n-order zero interval (n-OZI) in ultrahigh frequency segment is deeply investigated. Moreover, it is proposed that the distribution of n-OZI is the mark in the process of genome evolving and many features of the logarithm histogram of occurrence frequency are successfully explained from the view of biology. On the basis of many experiments, it is discovered and validated that the OFD of k-mers is subjected to non-central F distribution. Adopting several non-central F distributions can fit the density distribution of the occurrence frequency of k-mers in genome which has the same number of peaks. On the basis of experiments, the comparison between non-central F distribution and Gamma distribution which was proposed to fit genome distribution by Hsieh and Luo is studied through experiments. Due to the complement of the two distributions in fitting the density distribution of genome, a new probability distribution which combines non-central F distribution with Gamma distribution is presented, and experiments show that the new distribution is better than any single of the two distributions in fitting genome density distribution. After the relationship between the maximal frequency of k-mers in genome and the length of k-mers and the relationship between the number of different k-mers which occur only once in genome and the length of k-mers are deeply investigated, and it is discovered that the two relationships among many species are consistent each other, which are the evidences on the neutral evolution theory of genome.The problem that all k-mers in whole genome are counted simultaneously is researched, which is not an easy task because the size of the natural genome is too great. Therefore, the internal and external counting algorithms which count all k-mers in genome are designed and implemented. These algorithms firstly translate the problem of counting all k-mers into the problem of counting integer keys by the help of a hash function which maps a k-mer into an integer, so we can apply the classic B-tree algorithm to solve the problem of counting all k-mers in genome. At last, three measures are proposed to further improve the efficiency of the algorithm according to the features of DNA sequences.The tumor diagnosis method based on gene expression profiles will be developed into the fast and effective method in clinical domain in the near future. Although DNA microarray experiments provide us with huge amount of gene expression data, only a few of genes relate to tumor among the gene expression profiles. Moreover, it is a challenging task to extract feature or select informative genes related to tumor from gene expression profiles because of its characteristics such as the high dimensionality, the small sample set and many noises and redundancy in gene expression profiles. Therefore, the molecular diagnosis of tumor has been broadly and deeply investigated and a large number of relevant papers are published. At first we detailedly introduce the techniques and methods in tumor classification process model, and then the classification method for the tumor classification process model is proposed. At last the research results in tumor classification are summarized in the past several years, and the problem such as how to rationally assess the results of tumor classification and the further research on tumor classification are presented.After analyzing those tumor classification methods, we have designed six feature extraction approaches to extract the classification feature of tumor based on gene expression profiles. The six approaches are principal component analysis (PCA), factor analysis (FA), independent component analysis (ICA), wavelet package decomposition (WPD), PCA method based on discrete cosine transform (DCT) and PCA based on discrete Fourier transform (DFT) which are assessed by experiments on two well-known datasets which are the colon dataset and the leukemia dataset. The experiments with support vector machines (SVM) show that the six approaches not only can obtain good classification performance but also have their own characteristics. The proposed six approaches can extract a small quantity of components which are validated classification features related to tumor when retaining higher recognition rate; and the results of classification can be visualized in the form of 2D or 3D scatter plot. Among the six approaches, the PCA method based on DCT is the best method on reducing dimensionality and can achieve the highest classification accuracy; the cross-validation accuracy of 96.77% has been achieved for colon dataset using SVM classifier and 100% for leukemia dataset also. To FA and ICA, experiment results indicate that only several components can obtain higher classification performance. Therefore, we can deduce that only several genes (i.e. 3 or 4) in gene expression profiles can achieve higher classification accuracy, which is the basis of our further designing informative gene selection algorithm.The informative gene selection is need despite that the proposed feature extraction methods are effective in tumor classification. However, the accurate classification of tumor by selecting the tumor-related genes from thousands of genes is a difficulty task due to the large number of redundant genes, and usually it is impossible to apply an exhaustive algorithm to search informative gene subset in such large gene space. Therefore, two kinds of approximate algorithm which can search informative gene subsets are proposed and implemented.One is applying attribute reduction method based on classical rough set model and neighborhood rough set model to searching informative gene subsets. The classification accuracy rate of informative gene subsets obtained by using attribute reduction method based on classical rough set model is not high because there is information loss caused by discretization of gene expression profiles. To avoid this problem we design eleven gene selection methods based on the FARNeM (forward attribute reduction based on neighborhood model) algorithm to classify tumor sample set, which are proved effectively and quickly in searching informative gene subset. To improve the classification accuracy of the imbalanced tumor dataset, a weighted neighborhood classifier based on the neighborhood classifier is proposed, which is proved more effectively to tumor dataset with complex boundary between every two subtypes, and we introduce feature algorithm Simba (iterative search margin based algorithm) to the tumor classification domain based on gene selection, which also adapts to multi-class tumor dataset. The probability neural network ensemble based on the neighborhood rough set model is proposed to classify tumor dataset to obtain more reliable experiment results.Another method is the novel heuristic breadth-first search algorithm (HBSA) based on SVM classifier which can simultaneously find many informative gene subsets in which the number of informative genes is almost least but its classification accuracy is almost highest in spite of its characteristic of time-consuming. Experiments on three tumor sample sets show that the proposed approach is feasible and effective. The 4-fold cross-validation accuracy of 100% has been achieved by only two genes for leukemia dataset (14 2-gene subsets are found totally like this) and 100% by only four genes for colon dataset (7 4-gene subsets are found totally like this), which are superior to the results of other classification methods, and 100% by only four genes for SRBCT (Small Round Blue Cells Tumor) dataset (504 4-gene subsets are found totally like this). Experiment results are consistent with our prediction assumption. Compared with other tumor classification methods, our experiment results are obviously superior. To reflect the true classification accuracy rate of classifier, we proposed a full-fold cross-validated method which can eliminate the affect of the different partition for tumor sample set and can more objectively evaluate classification model.

Keywords/Search Tags:

DNA sequence, frequency distribution, tumor subtype classification, gene expression profiles, rough set model

PDF Full Text Request

Related items

1	Selected Based On The Gene Expression Profiles Of Tumor Characteristic Gene Studies
2	The Research On Gene Expression Profiles Data For Tumor Classification
3	Non-negative Matrix Factorization Methods Of Gene Expression Profile Analysis For Tumor Subtype Clustering
4	The Application Research Of Support Vector Machine In Non-spherical Distribution Data Set And Tumor Gene
5	Tumor Gene Identification Study On Support Vector Machine Classification Model
6	A Study Of Tumor Classification Algorithms Using Gene Expression Data
7	Tumor Classification Based On Gene Expression Studies
8	The Research On Gene Classification Algorithms Based On MicroArray
9	Identifying Protein Complexes By Integrating Gene Expression Profiles And PPI Networks
10	Algorithm Of Gene Expression Data Classification And Its Application