Font Size: a A A

Study On Clustering Algorithm And Its Applications

Posted on:2011-11-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:D M TangFull Text:PDF
GTID:1118360308467203Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Pattern recognition, also known as classification or statistical classification, the aim of pattern recognition is to build a classifier that can determine the class of data input. Clustering, also known as unsupervised classification, is an important unsupervised pattern recognition method. Clustering is one of the most widely used unsupervised data analysis method for exploratory data analysis, and it has been extensively applied in variety fields, such as computer vision, statistics, image processing, medical, biology, social sciences and psychology. The basic concept of clustering is to divide patterns into different groups (clusters). The patterns in the same share more similarity comparing with the patterns in other clusters.In this thesis, we focus on some issues of clustering algorithm and applications for bioinformatics, large-scale location problem, irregularly shaped data distribution. The main ideas and contributions of this thesis are summarized as follows.(1) Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Dr. Victor Velculescu provided the original technique in 1995. Clustering algorithms provide a useful tool to explore the potentially novel and significant transcript or gene groups in SAGE data. Although a number of clustering methods for SAGE data have been proposed, most of these methods employ some user-defined parameters; therefore, the results may highly depend on these parameters. In this thesis, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). AP has many strong points, but it is hard to find suitable parameters in practice. By using the clustering validation measure as a cost function of merging and splitting, the PoissonAPS overcomes the limitations of AP. The PoissonAPS can automatically cluster SAGE data without user-specified parameters. We evaluated our method and compared its performance with other methods on several real life SAGE datasets. The experiments show that our method can produce meaningful and interpretable clusters of SAGE data.(2) With the advances of high-throughput sequencing, the amount of protein sequences in databases has been increasing explosively. However, the speed of the experimental characterization of a protein is less than that of the generation of new-sequenced protein; the majority of new-sequenced protein sequences have no experimental evidence for their functions. Because experimental determination is time consuming and expensive, we would like to annotate new-sequenced protein sequences by homology to sequences of known function based on sequence similarity, without relying on experimental evidence. Clustering method provides valuable insight into the sequences by dividing the sequences into groups of protein families. In recent years, many clustering algorithms have been proposed to identify and annotate protein sequences. It may be necessary to perform a comparative study of these algorithms. The results of the evaluation may provide valuable insight for bioinformaticians and biologists of the investigated clustering algorithms. In this thesis, we present a comparative experiment of four typically algorithms for protein sequences. We conducted four types of experiment for each algorithm: Default parameters experiment; Order sensitivity experiment; Effect of the"True"cluster sizes on algorithms; Parameters tuning. The experimental results are useful for choosing the most suitable clustering algorithm for a specific application. Finally, the results show that there is still room for much improvement.(3) As modern high-throughput sequencing technologies continue to improve, there is an overwhelming amount of protein sequences un-annotated in the biomedical databases. Clustering protein sequences into homologous groups can help to annotate uncharacterized protein sequences. In this thesis, we introduce an online cluster analysis method for large-scale protein sequences based on online clustering algorithms and alignment-free similarity measure, namely, OnlineCAPS. The OnlineCAPS has many advantages, such as the memory requirements and computation cost are very low. The method is fast and enables us to extract clusters from a large-scale set of protein sequences, and can be deployed on the web server, and can perform clustering progress when uploading sequences dataset. The experimental results illustrate the efficiency of the proposed method.(4) Location problem is a well-studied problem in operations research. By treating location problem as a clustering problem, combining Affinity Propagation clustering algorithm and mapping information of candidates into feature vector, we present two methods for selection suitable situations from candidates: Location method based on region division; Location method based on road network. We evaluated two methods using synthetic data sets as well as real-world data sets. The experimental results show that two methods can solve location problem with both fixed and unfixed number facilities. In addition, the proposed methods can solve large location problems, and provide good solutions.(5) It is a difficult task in pattern recognition to detect the natural clusters for irregularly shaped data distribution. We propose an efficient clustering algorithm for irregularly shaped clusters based on the advantages of spectral clustering and Affinity Propagation (AP) algorithm. We give a new similarity measure based on neighborhood dispersion analysis. The proposed algorithm is a simple but effective method. The experimental results on several data sets show that the algorithm can detect the natural clusters of data input sets, and the results of the clustering analysis agree well with that of human judgment.
Keywords/Search Tags:pattern recognition, clustering, protein sequences, serial analysis of gene expression, location problem
PDF Full Text Request
Related items