Protein Subcellular Interval Prediction Using Improved K-means Algorithm

Posted on:2018-07-18

Degree:Master

Type:Thesis

Country:China

Candidate:R L Yang

Full Text:PDF

GTID:2370330575467109

Subject:Agriculture

Abstract/Summary:

In the prediction of protein subcellular interval,bag of words model is used to extract the feature of protein sequence,compared with the traditional feature extraction algorithm,the recognition accuracy can be improved effectively.At the same time,K-means algorithm is widely used in bag of words model,it’s easy to understand and implement as well as adaptive to large-scale data sets.But clustering effect is not stable because the algorithm selects initial clustering center randomly,and limitations for non cluster or multi-dimensional data sets,so using K-means clustering algorithm in bag of words models can not reflect the characteristics of protein sequence to a certain extent,and limitation of protein subcellular prediction interval success rate.In recent years,there are many improved algorithms based on K-means,but majority of them neglect the rule of the similarity measure of K-means algorithm,leading to the effect that K-means does not perform perfectly when reflecting arbitrary shape distributions.In addition,there is limited improvement for K-means on the basis of multi-dimensional data sets.Above these questions,this paper study on improving K-means algorithm,main contributions of the paper include:(1)Spatial-density similarity measure K-means algorithm is.proposed to solve the problem about cannot cluster the data sets with complex structure effectively.Based on new similarity measuring method and cluster-center iterative model,we compared the experimental results of traditional and improved K-means algorithm on non-convex artificial and UCI standard data sets.The results show that spatial-density similarity measure K-means algorithm was more stable and accurate.(2)In terms of multi-dimensional data sets,K-means algorithm treats each of variables fairly,which turns out unreasonable clustering results.So,we generate a Weighting-spatial density similarity measure K-means algorithm,combining with the weights.Via experiments in low-dimensional and multi-dimensional data sets from UCI,it shows that the weighted-spatial density similarity measure K-means algorithm improved the limitation of K-means algorithm for multi-dimensional data sets greatly.At the same time,the proposed algorithm is applied to construct the bag of words model,experiments were carried out on the protein sequence of ZD98 data set,the results tell us comparing with SMK-means algorithm,the traditional and improved K-means,using W-SMK-means algorithm in bag of words model is more accurate for expressing protein sequence features,and improve the accuracy for subcellular protein sequence interval prediction.(3)Facing the exponential growth of biological data,and demanding for dealing big data in application of protein subcellular prediction,we realized the parallelization of weighted-spatial density similarity measure K-means algorithm using MPI.By comparing with the running time of big data sets in different numbers of processes,the advantage of parallel algorithm has been shown,and improving the efficiency of building bag of worlds models in prediction of the protein subcellular intervals,and make the algorithm play an important role in practice.

Keywords/Search Tags:

Protein sequence feature extraction, K-means, Spatial-density similarity measure, Weighting, Parallelization

Related items

1	Approaches To Feature Information Extraction For Biological Sequences And Their Applications
2	Research On DNA, RNA And Protein Sequence Feature Extraction Method And Its Application
3	The Research On Protein Sequence Feature Extraction And Its Application On Protein Function Prediction
4	Numerical Feature Extraction Of Protein Sequences And Its Applications
5	Feature Extraction And Similarity Measure On Tobacco Near Infrared Spectra
6	Predicting Protein-protein Interactions From Protein Sequence Based On Multiple Feature Extractions
7	Research On Parallelization Of Improved AP Algorithm Based On Spark And Its Application In Protein Complexes Identification
8	Low-similarity Protein Structural Class Prediction Based On Multiple Features
9	Research On Method Of Feature Extraction For Sequence Based On New Expression Pattern And Its Application
10	Study On Feature Extractions And Similarity Of Protein Sequences