Font Size: a A A

Application Of Improved Biclustering Method To Cancer Gene Expression Data

Posted on:2010-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z B CaoFull Text:PDF
GTID:2178360272497574Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of biotechnology and bioinformatics, DNA microarray experiments have already been one of the most important tools for effectively analyzing gene expressions and the application of gene chip technology is anywhere near mature. The expression data of gene chip is the mRNA abundance obtained from measuring thousands of genes of different samples on various conditions and tissues. Through hybridization, chip scanning and data analysis, chip experiments can be carried out to measure the expression of gene, identify the hidden biological information by mathematical and statistical methods, and speculate the gene function and biological characteristics eventually.In the process of development of the high-throughput molecular biotechnology, more and more researchers devote themselves to the cancer gene chip studies. In recent years, the incidence rate of cancer increases continuously, which has already been a serious threat to our health and daily life. Scientists worldwide have been engaged in a constant struggle with cancer for quite a long time. In 2008, scientists found a great number of new genes related to the occurrence and development of cancer through the whole-genome DNA sequencing analysis. The studies of this area have recently been the focus of bioinformatics.Clustering analysis is a common method for processing the expression data of DNA microarray. With the rapid development of biology and biotechnology, the rate of accumulating of biological data has been speeded up exponentially and the shortcomings of applying only the traditional clustering methods are also been exposed gradually: 1) The traditional clustering methods are not efficient enough to discover some patterns that associate parts of the genes with parts of the conditions. They either group the genes according to their expression under multiple conditions, or group the conditions based on the expression of a number of genes. They can only find the global information. 2) The results of the traditional clustering algorithms are obtained by clustering genes or samples into different clusters. There are no crossing parts between different clusters. Taking clustering genes as an example, one gene can not appear in two or more different clusters at one time by the traditional clustering algorithm, the rule of which greatly restrict the capability of identifying multi-function genes. In order to overcome these disadvantages, biclustering gives a good solution to these problems.In 2000, Cheng & Church proposed the idea of biclustering for the first time. Differing from the traditional clustering methods, it gives an approach emphasizing on the simultaneity of clustering the genes and conditions, not the genes or conditions respectively in the expression data matrix. It can give a sub-matrix containing the subsets of the genes and conditions. According to this sub-matrix, we can find out the local information concealed in the data. It got good results when applied to the analysis of the gene expression data. Nowadays, biclustering algorithm plays a very important role in the analysis of gene expression data.In this paper, we use the improved biclustering algorithm to analyze the cancer gene expression data through investigating the traditional clustering algorithms and biclustering algorithms.1. The Improved Cheng-Church AlgorithmCheng-Church algorithm is one of the earliest biclustering methods used for finding the coherence of the genes and conditions of the gene expression data. It is a node deletion and addition algorithm, and uses a mean squared residue score to measure the similarity of the genes and conditions in a data matrix. It can give one bicluster each time. In order to get a new bicluster, it must replace the obtained bicluster using the random values. This will cause the inaccurate biclustering result. In this paper, we have improved the classical Cheng-Church algorithm by adding the combination and extending process to the obtained biclusters to reduce the influence of the randomization, and get much better biclustering results.2. Cancer Gene Expression Data Preprocessing1) Gene expression data collectionFirstly, download gene expression data of cancers and other diseases from the public and professional web site, such as GEO, SMD and Oncomine. Sort the data with the diseases or organizations and then obtain hundreds of gene express data. Divide the data into two classes as tumor and normal samples for further analysis. The data used in this paper is GDS2545 from GEO. It is about the prostate cancer gene expression data with the analysis of metastatic prostate cancer and primary prostate cancer.2) Data preprocessingMissing values estimation. Firstly, determine whether or not to retain a gene according to the rate of missing data. Secondly, estimate the missing value of gene expression data using local least squares imputation method. Finally obtain the full data without any missing value.Gene expression data normalization. The purpose of the normalization of gene expression data is to compare data from different gene chips. We utilize a method depending on the distribution curve of normal samples. Firstly suppose normal samples as the normal distribution. Secondly calculate the normal distribution curve of normal samples, and obtain the standard value of all samples depending on their area in the normal distribution.3) Gene selection The number of genes is much more bigger than samples, and there are no specific and no irrelative attributes in many genes for different samples. In order to solve this problem, in this paper, we use t-test and SVM-RFE statistical methods to select the feature genes and then form the data for further use.3. Application of Biclustering1) Evaluating the effect of the algorithm using simulated dataIn our experiment, we test our improved algorithm and compare the biclustering results with the Cheng-Church algorithm in the famous online available software BicAT. Experimental results demonstrate that the proposed method improves the biclustering results over existing methods and can find better patterns from the data.2) Biclustering analysis of cancer gene expression dataWe take the following steps to deal with the cancer gene expression data after the data preprocessing and gene selection process. First, we perform the biclustering process on the datasets with different parameters, and get the biclustering results. Then find the common genes between the dataset and the prostate cancer pathway, and sort the biclustering results according to the number of the common genes. We vote to the genes in the biclustering results with the higher rank, and acquire the genes depending on the voting results. Finally we analyze the function of the genes which we have selected.We use the improved Cheng-Church algorithm to analyze the GDS2545 prostate cancer gene expression data in GEO database. And analyze the genes relative to prostate cancer by comparing with the genes contained in the prostate cancer pathway in the KEGG database. It shows that our improved algorithm is efficient to the gene expression data.
Keywords/Search Tags:Cheng-Church Algorithm, Biclustering, Random Process, Feature Gene Selection, Cancer Gene Expression Data
PDF Full Text Request
Related items