Font Size: a A A

Research On Vectorized Representation Of Discriminative Capability Of Gene And Gene-based Clustering

Posted on:2015-02-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:T YuFull Text:PDF
GTID:1228330467983176Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Using data mining techniques in microarray is significant for finding functional genes, diagnosing tumor types, seeking medicine targets and so on. From the perspective of data mining, microarray gene expression data is the "illness" data because it contains a large number of genes (several thousands at least) but a relatively small number of instances (tens or even low). In such data, how to find functional genes and virulence gene attracts much attention in the field of data mining. Feature selection aims to select the features who are helpful for classification. Therefore, the common feature selection methods are widely used in microarray data sets.However, according to the’characteristics of microarray data, there are some unhandled or ignored problems when feature selection methods are used to select genes and the classifiers are constructed according to the selected genes. First of all, there are very few works about the leamability analysis of the classifiers which are used in microarray data because many classifiers cannot be easily analyzed. The second problem is that how to represent genes may lead to different selection results. Most traditional selection algorithms adopt a single quantitative value to represent a gene. This single value is attained by searching maximum value or accumulating, and the detail information between genes and samples’ categories is covered. This detailed information is significant to represent the discriminative capability of gene. And the traditional selection algorithms have the changeless frameworks. The comprehensive representation methods which are not with a single value cannot directly be used in feature selection. Thirdly, fuzzy clustering and the similarity measurement are separately used in feature selection. However, the similarity measurement cannot be directly applied into fuzzy clustering, so the prior knowledge taken by the similarity measurement is unusable for fuzzy clustering. Therefore, the research is developed as following:Firstly, in the view of learning theory, the necessity of gene selection and the leamability analysis of the classifiers used in microarray data are developed in this paper. VC-dimension of classification models and the probably approximately correct (PAC) learnability instances upper bound are two criterion to evaluate a classifier. Because of the differences in theoretical origin and the spatial structure, not every classifiers’ PAC learnability can be analyzed. In this paper, the VC-dimension of rough hypercuboid classifier (RHC) based on gene microarray data is learned. From the spatial structure of RHC, the VC-dimension is attained. According to this, target concept can be learned by RHC from a polynomial number of instances and the processing time for each instance is bounded polynomially. It can be concluded that RHC is PAC-learnable. Comparing with many other classifiers’ VC-dimension, RHC’s VC-dimension is better than others. However, if RHC is PAC-learnable, so many instances are needed. We conclude that if we want to increase the PAC learnability of RHC, dimension reduction is an efficient way to decrease the number of the needed instances.Secondly, the vectorized representation of discriminative capability of gene is proposed to provide comprehensive representing information of the gene. Through recording discriminative information for each cluster by statistical methods, the gene can be represented by the vector. Comparing with the single quantitative value, the vectorized representation of discriminative capability of gene can reflect a certain gene’s discriminative capability on different categories. The possible problems of "bias" and "cumulative error" can be avoided by the proposed criterion. Furthermore, according to the characteristics of the proposed representation, a feature selection method based on the proposed representation is proposed. A gene candidate subset can be attained by introducing quantitative vector of gene’s discriminative capability. This candidate subset reserves as much discriminative information as possible. According to this candidate subset, random searching strategy is directed by the qualitative vector of gene’s discriminative capability to search the resulting gene subset. By the quantitative and qualitative vector, this method can select genes according to the information of discriminative capability and combines the feature selection with the application of diagnosing tumor types.Thirdly, through improving the original feature vector by category information, an improved feature vector with supervised information is proposed. According to this vector, the distance-based measures are still non-similarity measures which can measure the relationship between different genes. However, the prior knowledge has been introduced into these non-similarity measures. Through using this novel feature vector, we can develop fuzzy clustering on microarray data without changing the framework of fuzzy algorithm. The limitation of using non-similarity measurement in fuzzy clustering is improved. The fuzzy clustering can be developed under supervision and the quality of description of the spatial distribution is improved.At last, a partition coefficient and boundary density based validity function (PCBD) is proposed in this paper. Different from most validity functions which adopt the density of instances in the boundary and the distance between the boundary of different clusters to evaluate the separation, PCBD utilizes the boundary information between different clusters to measure separation. This validity function finds the nearest cluster of each cluster at first. Then it uses the density value of the mid-point of two clusters as the certain cluster’s separation value. The separation value of the current partition can be calculated by accumulating each cluster’s separation values. Through combining the compactness measurement and the separation measurement, PCBD can efficiently evaluate the fuzzy clustering results. With the help of PCBD, the optimal number of clusters can be attained by searching the maximum PCBD value within different cluster number.This paper has conducted extensive experiments and analysis using microarray data. The experiment results show that vectorized representation of discriminative capability can comprehensive represent genes. Comparing with most gene selection methods, the proposed method can get the gene subset with higher forecasting accuracy rates. And the quality of fuzzy cluster and PCBD are proven by experiments.
Keywords/Search Tags:microarray data, feature selection, vectorized representation of discriminative capability, fuzzy clustering, clustering validity function
PDF Full Text Request
Related items