Font Size: a A A

Research On Algorithms For The Cancer Differential Gene Expression In Gene Microarray

Posted on:2010-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:H W TuFull Text:PDF
GTID:2178360272495846Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At the end of 80's last century, with the data of human genome project grows rapidly, bioinformatics becomes a new research subject with interconnection crossover. It is consisted by computer science, mathematics, information technology and biomedical. It aims to reveal the biological significance through the acquisition, processing, storage, assignment, analysis and interpretation of biology information by using math, computer science and biology comprehensively. Bioinformatics plays a key role in the discovery of human disease gene finding, the expression and function research on the gene and protein, the rational design about drug and so on.Differential gene expression means that different cells express their genes follows the time and space order at different stages during the ontogeny of higher organisms. The abnormal expression of genes may relate to the occurrence of cancer directly or indirectly. It is of great meaningful that reveal cancer mechanism at gene level and explore the effective method in cancer treatment by filtering and identifying genes with differential expression. With the development of genome sequencing, DNA microarray technology is gradually developed to explore the genome functions. Gene outliers which probably cause cancer in the specific type of cancers can be detected by DNA microarray technology.Classical analysis methods on differential expression are given as follows: non-statistical multiplier analysis, satisfy homoscedasticity t-test, non-homoscedasticity t-test, small variance unaffected SAM and Bayesian model. Several new methods which make the data mining about gene chip improved constantly are proposed,recently. Note that these methods are just for the case that all the samples of cancer tissue are over-expression compared to the samples of normal tissue. However, for the case that only a few of cancer samples are over-expression, these methods have higher FDR.In 2005, Tomlins etc published a paper about a new type of differential gene expression in Science: It was the first time that a non-random, recurrent gene fusion had been found in a common epithelial solid tumor (such rearrangements were previously thought to occur primarily in sarcomas, leukemias and lymphomas). Not only might this rearrangement be responsible for the majority of prostate cancers, but, due to the high incidence of this common tumor, it may actually represent the most common rearrangement in human cancers overall. This discovery will greatly encourage scientists to take another view at other cancers. These cancers may exhibit similar rearrangements and may lead to a new diagnostic tests or therapeutic targets. This discovery raises a question that: for a particular gene, the cancer samples are not all over-expression compared to the normal samples. Our work is mainly based on this significant discovery.After the discovery, two kinds of analysis algorithms are proposed: 1. statistical methods based on the COPA, including COPA, OS, ORT, MOST; 2. statistical methods based on the change-point analysis, including LRS. The first series of algorithms are based on the T statistics. Since T statistics only applies to analysis that the cancer outliers belong to the distribution of higher mean, and many cancer researches show that a lot of genes in the disease samples are over-expressed while the number of the disease samples in the whole sample is quite small. Therefore, Tomlins(2005) proposed COPA in order to solve this problem. COPA uses the median and median absolute deviation to instead of the mean and standard deviation in T statistics, COPA has better result than T statistics with this improvements. For the application of additional expression value to estimate the cancer sample outlier, Tibshirani and Hastie (2006) proposed a new statistical method OS. Since the outlier are defined relative to normal samples, rather than the merger samples, Wu(2007) improved ORT and proposed a new method ORT. Because, in the OS and ORT statistics, the definition of outliers is based on a simple way of the traditional mathematical statistics, Heng Lian(2008) proposed a statistical method named MOST that considers all possible outlier threshold values. We analysis the above methods, especially MOST, and propose a new method TMOST by using tri-mean instead of the median in the previous formula and tri-mean absolute deviation instead of the median absolute deviation. By the test of simulated data and real data, TMOST has better result than the above methods. Jianhua Hu (2008) proposed a new opinion: The existing methods intended to identify the outliers based on the quantiles of the gene expression profile across all the samples. This problem can be pursued from a different perspective which is to detect a change point in the distribution of gene expression intensities in the cancer group. Based on this opinion,Jianhua Hu (2008) used the algorithm which was proposed by James etc in 1987 to resolve boundary crossing question in sequence and proposed LRS which had better result. This paper uses the nonparametric statistical method about distribution change point problems proposed by Zhiping Tan etc(2000) and set each gene as an object to detect the existence of outlier, then get the information about the non-differential expression samples and differential expression samples to calculate the differential expression values for each gene by T statistics. This is the algorithm named CPT.In this paper, we test the two kinds of algorithms by using simulated data and real data. When we deal with the simulated data, TMOST has higher sensitivity and specificity than COPA, OS, ORT, MOST through the ROC curve analysis and has lower error detection rate in the same positive rate through the FDR curve analysis. We also use real data to test the algorithms and find that TMOST statistics can identify 7 genes that have been shown to be associated with the development of breast cancer, while MOST identifies 4, ORT identifies 6, OS identifies 6 and COPA identifies 6 among their top 25 genes. Then the paper compares CPT with LRS about the effect of detection. When we deal with the simulated data, CPT has higher sensitivity and specificity than LRS through the ROC curve analysis and has lower error detection rate in the same positive rate through the FDR curve analysis as the samples of differential expression reach a certain number. We use real data to test the algorithms and find that CPT statistics can identify 9 genes that have been shown to be associated with the development of breast cancer, while LRS identifies 9 among their top 25 genes. Therefore it could be concluded that the work of this paper has certain theoretical and practical significance.The future work is as follows: 1. Set up a standard real data set which we can evaluate the detection effect of the algorithm; 2. Model the new situation about differential expression in theory and improve the detection accuracy and detection capabilities; 3. Propose a new algorithm based on some classical methods such as: principal component analysis, Bayesian algorithm.
Keywords/Search Tags:Bioinformatics, Cancer, Gene Microarray Technology, Change Point, COPA, Differential Expression Gene
PDF Full Text Request
Related items