Font Size: a A A

Support Vector Clustering Method And Its Applications To Biomedical Datasets

Posted on:2010-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y SunFull Text:PDF
GTID:2178360272495948Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is an unsupervised machine learning method. It can find the potential structure in dataset to classify the samples of the dataset without any prior knowledge, so it can help people to gain new knowledge. The application domain expands continual along with the algorithm improves, so that clustering analysis is still important one of hot spots in the study of data mining until now. It was already used broadly in many areas, such as pattern recognition, image processing, data compression, machine learning, statistics, bioinformatics, and so on.There is no single best clustering method in clustering analysis. First, different clustering methods base on different similarity definitions often have different results, and most of the methods have the dependence to the dataset distribution; Next, the existing methods are not perfect, whose time and space complexity require to be improved. In practical applications, a good cluster method should have the low time complexity, and be available for the different distribution pattern dataset as much as possible.Support Vector Clustering is one kind of kernel-based unsupervised clustering method, proposed by Ben-Hur in 2001. It has many advantages compared to the traditional clustering methods. However, the computational complexity of the cluster assignment procedure is considerable, which limits the application of SVC.After studied many clustering algorithms, we proposed a novel support vector clustering method based on k-means (SVC-KM). The main process is: Mapping sample points to Hilbert space by non-linear kernel function at first, so to seek for the smallest super sphere that could enclose all the samples. Then SVC-KM used minimum spanning tree pruning (MSTP) strategy on those intra-cluster points to initialize the number of clusters and clustering centroids automatically. Finally, it run k-means algorithm on subset without outliers to obtain the clustering result and assigned outliers into the nearest cluster. The improving idea in this article makes big differences with former papers. The new algorithm has not only inherited the characteristics of original SVC algorithm, such as fewer parameters and automatically generating the cluster number, but also obviously speed up the computing speed. It is especially available for high-dimensional and small sample size case. As the three main processes involved in the algorithm can complete the clustering analysis independently, if they were combined well, the using scope and robust of the algorithm can be enhanced. Through theoretical analysis, we can see SVC-KM algorithm avoids adjacency matrix computation by using the MSTP strategy and k-means algorithm to the cluster assignment, so the time complexity is lower than original SVC algorithm. Besides, since there is not direct connection between the results and the change of parameters, the algorithm weakens dependence to the parameters.In order to confirm the feasibility and the validity, we applied SVC, k-means and SVC-KM algorithm to clustering 5 gaussian mixed distribution datasets with the same distribution, but different sample sizes or dimensions. The results indicated that when obtaining the same number of cluster, according to all 5 datasets, SVC is the slowest one; Comparing the results from 3 datasets with the same sample size but different dimensions, we found that the computing time of k-means increased obviously as the dimension get higher. However, original SVC algorithm didn't appear obvious change. SVC-KM algorithm need the shortest running time in the three algorithms, and obtained very good result. The experiments also made clearly out the applicable domain of three method: The original SVC algorithm is quite suitable to process the datasets with high-dimensional and small sample size case, but the running time is excessively long; the k-means algorithm is quite suitable to process the datasets with low-dimensional and big sample size case, the running rate reduces as the sample dimension increase; Owing to reasonably joining these two kind of classical clustering algorithm, the running rate of SVC-KM algorithm is quickest.SVC-KM algorithm has no bias to the data distribution specially, for it can still get very good results even to the non-spherical shape data, and K-means algorithm does well to the datasets with spherical distribution, but it's sensitive in choosing initial cluster center. Therefore, SVC-KM could be applied to more datasets than k-means. When clustering the IRIS dataset, we consider some samples as outliers and eliminate their influence to the final cluster center by SVC-KM, so there are 11 misclassifications in the SVC-KM cluster assignments, while 16 misclassifications in k-means. Thus it can be seen that SVC-KM can handle the datasets with noises or clusters overlapping very well.For the recent years, high-throughput technology take a rapidly improvement represented by gene chip, and the biomedical data had extremely increases. Mining these data information effectively can not only raise the information management efficiency, but also help people to discover latent biology knowledge in data, in order to enhance people's understanding to the biomedicine phenomenon. Even if are the same disease, according to the morbidity spot, the pathology type and subtype, the different sickness time and the development trend, they has the different treatment method. One of important applications in biomedical data mining is cancer subtype recognition, which may help to learn cancer pathogenesis, seek for the cause of disease, provide the basis for the diagnosis and the prognosis judgment.We use 3 biomedical datasets (MTTD, DLBCL and Leukemia) to test for the SVC-KM algorithm: The algorithm has MTTD divided into 6 clusters, that accurately recognizes prostate and the colon organization as two different kinds, distinguishes the breast and lung organization to two more subclasses, respectively. In the result, only 2 breast samples are misunderstood for the lung class, and the other 101 samples are correct gathered. The rand index of MTTD is 0.95; At the meanwhile, the rand index of DLBCL and Leukemia datasets cluster result is high to 0.92 and 0.89, respectively. SVC-KM finds DLBCL dataset with 6 subclasses and Leukemia dataset with 3 subclasses accurately. In the further research, for the actual classified situation unknown supposition, only using Silhouette index, the best results we obtained still have the biomedical clinical significance. This explained that SVC-KM algorithm can automatically find the results with very strong solubility, which are conform to the data intrinsic structure cluster. Obviously, the SVC-KM algorithm may obtain same cluster result as the actual classified situation basically.
Keywords/Search Tags:Support Vector Clustering, Minimal Spanning Tree Pruning, k-means algorithm, SVC-KM, Tumor Classification
PDF Full Text Request
Related items