As an advanced knowledge discovering method,clustering has been widely used in many fields.In the area of bioinformatics,clustering has been demonstrated to be especially useful in dissecting gene expression data.A major reason for using clustering as the main method for the analysis of gene expression dat a is that the number of genes is huge,but the number of known genes is still small.It 's impractical to infer unknown gene functions with little known knowledge using supervised methods.Moreover,models trained by small data cannot represent the complete data distribution.Clustering methods,on the other end,do not need prior knowledge,thus is more suitable for gene expression data analysis.As the volume of gene expression data expands,this process becomes inefficient.Therefore,a fast clustering algorithm that can save a lot of experiment time is required.As a popular method for data reduction,random sampling is often used when data is distributed evenly.However,in most real datasets,data obeys power law distribution,so in most cases,random sampling tends to lose small clusters.Even with density biased sampling,the sampled genes' cluster result can't be mapped to the original genes accurately,which will damage the cluster performance badly.So existing efficient cluster methods are not suitable for ever-increasing gene expression data.In this paper,a data compression method for gene expression data clustering inspired by cluster boundary detection is proposed.Since the exist cluster boundary detection algorithms are not efficient,the proposed method first use PCA to change data to low-dimension space,and splits grids in low-dimension data.The cluster boundary detection question is converted to boundary grid detection question.In order to speed up this process,a quadtree index method i s introduced.After forming different sizes of grids,density biased sampling is introduced to determine the compression ratio.By adding up weighted genes expression data in the same grid,compressed representation for each grid is formed.The reason why compression method exceeds sample method is that compression method can restore the original genes clustering result more accurately.According to several experiment results,the pure clustering process takes up about 2% time of the whole process,when clustering with a large k value,the proposed method could be around 2 to 4 times faster than original K-Means.After running more than 50 times of clustering process,the proposed method could be around 12 times faster.When dealing with gene expression data in which situation cluster needs processing several times,compared with Minibatch-Kmeans method,a state of the art efficient cluster method,the proposed grid compression clustering method has better quality and speeds up cluster more prominently,which provides good starting point for the interaction between clustering and gene enrichment analysis. |