Font Size: a A A

Research On Gene Selection Algorithm And DPC Clustering Algorithm Based On Clustering

Posted on:2016-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:H C GaoFull Text:PDF
GTID:2208330473461435Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Feature selection is one of the most important issues in data mining and pattern recognition. Its main task is here:improving the prediction performance of the classification, providing faster and more cost-effective predictors, and understanding the underlying process that generated the data better. Nowadays there are amounts of data with high-dimensional features, and most of those features are redundant or irrelevant to the classification target. Those irrelevant features make it more difficult to store the data, and bring great challenges to measure the similarity between samples, so that the general measures often failed to use, which leads to the credibility of classification or clustering results reduced. Fortunately, we can obtain the high quality feature subset, reflecting more information of the original datasets, through effective feature (gene) selection algorithms. With the help of feature selection algorithms, we can not only reduce the cost of data storage, but also improve the accuracy of classification or clustering results.Clustering is one of the most important tools to discover and understand the world, we can find meaningful and valuable information and reveal hidden patterns and rules by means of clustering methods. It has been widely used in many fields such as science and engineering systems. Recently, many researchers have shown that we may get high quality feature subset through merging clustering strategy into feature selection procedure. Therefore, this thesis focuses on the combination of clustering and feature selection together to finish the feature subset selection procedure, especially for the gene expression data sets with high dimension and small sample size. The main innovations of this thesis are as follows:(1)A novel hybrid feature selection method is proposed in this thesis. It combines the fast and efficient K-means clustering algorithm and statistical correlation together, where the statistical correlation is adopted to calculate the importance of each gene and filter some less important genes firstly. Then the dataset is divided into the training set and testing set in bootstrap method, and the pre-selected genes of training sets are grouped into different clusters. After that, we select the most representative gene from each cluster, which may have the highest weight or votes to comprise the selected gene subset. Finally, The SVM classifier is built using training set samples with the selected gene subset. We evaluate the quality of the selected gene subset by the performance of SVM classifier on the testing set. The Comparetion with the classical algorithms such as SVM-RFE, The time spending on selecting the same number of high quality subsets is only 4% of the former, suggests that the proposed algorithms can select high quality subsets in a short period of time.(2) In order to get the stable gene subsets, we propose an assemble method to select the genes with high discrimination for cancers regardless of the difference on the training subsets. We assemble the selected gene subset obtained by the aforementioned feature selection algorithm on different train subset to get a union of all selected gene subsets, then select the top-k highest frequency genes to construct the selected gene subset. This assembled method not only enhances the stability of selected gene subsets, but also improves the quality of it. Experimental results in 3 popular genes expression datasets illustrate the effectiveness of the algorithm.(3) In order to overcome the two defects of DPC algorithm, we propose a novel clustering algorithm based on K nearest neighbors. The algorithm redefines the local density and excludes outliers and also develops two new assignment strategies based on the K nearest neighbors of a sample. Then it adopts the method used in DPC algorithm to find the initial cluster centers (peaks) by means of decision graph. The two new assignment strategies are in turn used to assign remain points. The theory analysis and the thorough experiments on several test datasets including both synthetic and real-world datasets demonstrate that the proposed method can find the right cluster centers and recognize the clusters regardless of their shape and of the dimensionality of the space in which they are embedded, and be robust to the outliers, and the value of three cluserting evaluation criteria is higher than the original algorithm on Science journal.
Keywords/Search Tags:feature selection, gene selection, stable gene selection, clustering, DPC clustering
PDF Full Text Request
Related items