Font Size: a A A

Research On Analysis Of Gene Expression Profile Data In Bioinformatics

Posted on:2009-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2178360242480636Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Gene expression profile is the result of gene translation detection, which can reflect the vital movement. Analysis of gene expression profile data is one of the key subjects in Bioinformatics. Currently, the genetics research concerns about the relationship of different genes more than the function of a certain gene with the development of microarray technology, which is one of the most recent and important experimental breakthroughs in molecular biology. We can look at many genes at once and determine which are expressed in a particular cell type by microarray. As a major method of gene expression detecting, microarray can be used for the research on cancer, so it attracts more and more attention.Because microarray is a high-throughput device, the obtained data of gene expression profile has some characteristic such as high dimensions and strong noise. The purpose of this paper is to design effective methods of reducing dimensions and denoising so that the processed gene expression profile data can reflect the information of cancer classification simply and exactly. Firstly, considering the high dimensions of microarray data, an improved genetic algorithm used for feature gene subsets selection is proposed in this paper to eliminate the redundant genes which has nothing to do with classification. Secondly, a perturbed regression error sensitivity algorithm with a progressive strategy is proposed for the detecting of potential sample with labeling errors in gene expression profile data.It is an important subject to extract feature genes from microarray expression profiles in the study of computational biology. People expect to find some genes that relate directly to a particular disease. So it is necessary to exploit effective methods of feature selection to eliminate redundant genes as many as possible from a large amount of genes. There are two key problems to be solved for feature selection: the one is the criterion of best features, and the other is the search method to find best features. In machine learning, there are two approaches for feature selection: filter and wrapper. Some quantificational criterions are usually used to evaluate feature subset in filter approach instead of accurate rate of classification in wrapper. Based on an improved genetic algorithm(IGA), a feature selection method with a wrapper model is proposed in this paper to find a feature gene subset in which the genes related to diseases could be kept and the redundant genes could be eliminated as many as possible. In the method, the information entropy is used as separate criterion, and the crossover and mutation operators in the genetic algorithm are improved in order that the feature gene number of the subset could be controlled during the process of genetic operation. After analyzing the expression data of microarray, the artificial neural network method is used to evaluate the feature gene subsets that are obtained in different conditions. Comparisons of the trend of the fitness index from the improved genetic algorithm with the index from the neural network method are performed. It can be seen from the simulated experimental results that the classified accuracy of the feature gene subset with the best size and the best number of generations reaches more than 80%, so the proposed feature selection method can be used to find the relatively optimal feature gene subset that possesses more useful and less redundant information. An advantage of this method is that the size of the feature gene subset can be well controlled by the IGA without assisting of other ways.As a kind of noise, samples with error label which are labeled as wrong classes by mistake often appear in gene expression profile. The causes of labeling error are mostly subjective factors in disease experiments such as the mistakes of researchers or doctors. Considering the wide use of classification in cancer diagnosis, samples with error label which have a terrible effect on classification can be a disaster in medicine application, so an effective and precise method of labeling error detecting is strongly needed for avoiding or minimizing the loss caused by labeling error. Based on the in-depth study of the two existed methods, the classification-stability algorithm (CL-Stability) and the leave-one-out error sensitivity algorithm (LOOE-Sensitivity), a perturbed regression error sensitivity algorithm based on a progressive strategy (PPRE-Sensitivity) is proposed in this paper. Analyzing the basic idea of LOOE-Sensitivity algorithm, the cause of poor effect of LOOE-Sensitivity is found out. To remedy the defect of LOOE-Sensitivity, the perturbed regression error sensitivity algorithm (PRE-Sensitivity) is designed based on the construction of perturbed regression matrix using support vector machine. Furthermore, discovering the useful characteristic of PRE-Sensitivity, a progressive strategy is properly settled in PRE-Sensitivity, in which CL-Stability results and the average perturbed influence quantity are used as the evaluation criterion, then gradually fix the labels of the suspicious samples to reach the best solution. Using the colon gene expression profile and the breast gene expression profile as experiment data sets, the original methods and the methods in this paper are applied to the original data sets and the artificial data sets which are constructed by filtering out the suspicious samples and flipping the right labels randomly. At the same time, some kinds of synthetic datasets are designed as supplementary experiment data. Compared with the other methods, the PPRE-Sensitivity algorithm in this paper can remarkably increase the precision ratio of detecting while maintaining the highest recall ratio. The precision ratio of PPRE-Sensitivity is about 10% above the highest of the others. It means that PPRE-Sensitivity algorithm in this paper is effective and superior.Based on some machine-learning methods, an improved genetic algorithm used to select feature gene subsets in microarray data and a perturbed regression error sensitivity algorithm based on a progressive strategy used to detect the samples in gene expression profile data with labeling error are proposed in this paper, and the effectiveness of both methods have been proved by experiments. The research in this paper makes a good effort for the analysis of gene expression profile data.
Keywords/Search Tags:Bioinformatics
PDF Full Text Request
Related items