Font Size: a A A

A Cluster Analysis Method For Gene Expression Data Based On Stable And Sparse Principal Components

Posted on:2019-04-21Degree:MasterType:Thesis
Country:ChinaCandidate:H Y QuFull Text:PDF
GTID:2430330542484335Subject:Statistics
Abstract/Summary:PDF Full Text Request
Cancer is now one of the most difficult-cured diseases in the human history,and the data on the gene expression in cancer are also increasingly drawing atten-tion from researchers.More and more researchers have been realizing that different pathological cells can be quickly identified by the classification of gene expression data.The statistical methods of cluster analysis can help us diagnose diseases.However,at present,only limited sample sizes have been obtained through exper-iments.However,data in each sample have massive gene expression data,which have both high dimensional and small sample gene expression data.They contain a large number of miscellaneous and disturbing experimental results.If we use of the existing clustering method the data directly using the cluster analysis,often can only get poor accuracy of the conclusion.In this article,based on stable sparse principal component gene expression data clustering analysis method,whereas the stable sparse principal components are used to find the sparse factors in a stable selection method.In addition to the maximum variance,this paper can also have strong explanatory power.First,it introduces principal component analysis as a basic tool for visualization and reduction of basic tools in bioinformatics.However,we know that the principal component may not be able to continuously estimate the true direction of the maximum variability in the high dimensions and low samples with typical characteristics of molecular data.Moreover,the load factor is always non-zero,Such a feature makes the principal components unable to have strong explanatory power.Most of the sparse principal components are proposed on the basis of the variable selection Lasso theory in regression analysis.However,it is well-known that Lasso lacks the consistency during the variable selection in the high dimension.Therefore,there is a misleading result of the selected gene.This method is not stable.Therefore,it is proposed that the stable selection of resampling and forward selection should be applied to the sparse principal components.The above-discussed three methods can be combined with K-means and hierarchical analysis to analyze the GEO data clustering.In the end,this thesis employs the gene expres-sion data set from two GEO gene databases for data analysis.Additionally,from the experimental results,it indicates that the stable sparse principal component is more accurate when analyzing gene expression data clustering.
Keywords/Search Tags:The principal components, singular value decomposition, sparse principal component, Stable selection, cluster analysis
PDF Full Text Request
Related items