Font Size: a A A

Similarity Measure And Feature Extraction On Gene Expression Data

Posted on:2012-03-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:W J WangFull Text:PDF
GTID:1228330395457216Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
DNA microarray technology can simultaneously detect thousands of genes or evenwhole-genome expression levels, which provides a new way for the disease diagnosisand treatment at the molecular level. Gene functional classification and diseasediagnosis by using vast amounts of gene expression data has become a hot issue in thefield of bioinformatics.Clustering is an important means for gene functional classification, and theselection of similarity measure is critical. Classification is an important means fordisease diagnosis, but because of the high dimensionality of gene expression data,effective dimension reduction and feature extraction becomes a key step beforeclassification. For the task of gene clustering and sample classification, related issuesare discussed in this thesis from two aspects: one is gene similarity measure, and theother is feature extraction.The first aspect is the study of similarity measurement for gene clustering. Thedeep relationships between genes are extracted through gene expression levels. Thesimilarity between genes is measured from two different aspects: one is from individualcharacteristics, and the other is from relationship characteristics, and the shortestpath-based similarity measure and network topology-based similarity measure areproposed respectively. Clustering is performed on gene expression data, andexperiments results verify the effectiveness of the proposed methods.(1) The shortest path-based similarity measure is proposed from the point view ofindividual characteristics. The network of co-expressed genes is built through geneexpression correlation, and the similarity of individual characteristics is obtained byfinding the shortest path, with the shortest path length being gene similarity measure.Clustering is performed on yeast data using traditional clustering methods based on thissimilarity measure, and the results are compared with that of clustering based onEuclidean distance or Pearson’s correlation. The results show that the shorstestpath-based similarity measure can obtain better clustering performance.(2) Network topology-based similarity measure is presented from the point view ofrelationship characteristics. The gene relation network is constructed by setting athreshold on gene expression correlation, and then the relationship characteristics arerepresented by local topology of the network, with some similarity of relationshipcharacteristics being gene similarity measure. Clustering is performed on yeast datausing traditional clustering methods based on this similarity measure, and the results verify the feasibility of network topology-based similarity measure.The second aspect is the study of feature extraction for sample classification. Forthe application limitations of traditional feature extraction methods on high-dimensionalgene expression data, two methods are proposed: one is a sample space-based featureextraction method, and the other is a new discriminant feature extraction method.(1) To solve the problems of high computational complexity and serious singularitywhen performing feature extraction on gene expression data using traditional methods, asample space-based method is proposed. The feature extraction space is converted fromhigh-dimensional gene space to low-dimensional sample space through algebraictransformation, with the optimal transformation vector being represented by the linearweighted sum of samples.This method can effectively reduce the computationalcomplexity of feature extraction and the extent of matrix singularity. The experimentsresults on gene expression data show the effectiveness of the method.(2) To cope with the problems that the dimension number of optimal subspaceobtained by fisher Linear Discrinimant Analysis(LDA) is restricted by the number ofclasses and the computational complexity of covariance matrices is much high, a newdiscriminant feature extraction method is presented, called Class PreservingProjection(CPP). The objective function is designed to minimize the average distance ofwithin-class samples while maximizing that of between-class samples, with the classrelation between any two samples to be as the weight matrices. The optimaldiscriminant feature is obtained by linear transformation. Kernel CPP (KCPP) ispresented by generalizing CPP to nonlinear space to solve the problem of nonlinearfeature extraction. Compared to LDA, CPP can obtain higher-dimensional optimalsubspace, and there is no need to calculate covariance matrices. The experimentalresults on gene expression data verify the feasibility and effectiveness of CPP andKCPP.
Keywords/Search Tags:gene expression data, similarity measure, feature extraction, CPP
PDF Full Text Request
Related items