Font Size: a A A

Protein Subcellular Interval Prediction Using Improved K-means Algorithm

Posted on:2018-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:R L YangFull Text:PDF
GTID:2370330575467109Subject:Agriculture
Abstract/Summary:PDF Full Text Request
In the prediction of protein subcellular interval,bag of words model is used to extract the feature of protein sequence,compared with the traditional feature extraction algorithm,the recognition accuracy can be improved effectively.At the same time,K-means algorithm is widely used in bag of words model,it's easy to understand and implement as well as adaptive to large-scale data sets.But clustering effect is not stable because the algorithm selects initial clustering center randomly,and limitations for non cluster or multi-dimensional data sets,so using K-means clustering algorithm in bag of words models can not reflect the characteristics of protein sequence to a certain extent,and limitation of protein subcellular prediction interval success rate.In recent years,there are many improved algorithms based on K-means,but majority of them neglect the rule of the similarity measure of K-means algorithm,leading to the effect that K-means does not perform perfectly when reflecting arbitrary shape distributions.In addition,there is limited improvement for K-means on the basis of multi-dimensional data sets.Above these questions,this paper study on improving K-means algorithm,main contributions of the paper include:(1)Spatial-density similarity measure K-means algorithm is.proposed to solve the problem about cannot cluster the data sets with complex structure effectively.Based on new similarity measuring method and cluster-center iterative model,we compared the experimental results of traditional and improved K-means algorithm on non-convex artificial and UCI standard data sets.The results show that spatial-density similarity measure K-means algorithm was more stable and accurate.(2)In terms of multi-dimensional data sets,K-means algorithm treats each of variables fairly,which turns out unreasonable clustering results.So,we generate a Weighting-spatial density similarity measure K-means algorithm,combining with the weights.Via experiments in low-dimensional and multi-dimensional data sets from UCI,it shows that the weighted-spatial density similarity measure K-means algorithm improved the limitation of K-means algorithm for multi-dimensional data sets greatly.At the same time,the proposed algorithm is applied to construct the bag of words model,experiments were carried out on the protein sequence of ZD98 data set,the results tell us comparing with SMK-means algorithm,the traditional and improved K-means,using W-SMK-means algorithm in bag of words model is more accurate for expressing protein sequence features,and improve the accuracy for subcellular protein sequence interval prediction.(3)Facing the exponential growth of biological data,and demanding for dealing big data in application of protein subcellular prediction,we realized the parallelization of weighted-spatial density similarity measure K-means algorithm using MPI.By comparing with the running time of big data sets in different numbers of processes,the advantage of parallel algorithm has been shown,and improving the efficiency of building bag of worlds models in prediction of the protein subcellular intervals,and make the algorithm play an important role in practice.
Keywords/Search Tags:Protein sequence feature extraction, K-means, Spatial-density similarity measure, Weighting, Parallelization
PDF Full Text Request
Related items