Font Size: a A A

Sequence Similarity Based On Co-occurrence Word Frequence

Posted on:2020-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:T T YuFull Text:PDF
GTID:2428330620956746Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Next generation sequencing technology generates a large amount of biological sequence data,which has provided great resources for research and application.In the meanwhile,the challenges arise.One of the big problems is how to process these huge volume of data quickly and effectively.A lot of time and effort are needed to annotate these sequences and extract useful information from them.In this work,the co-occurrence word frequency is used as the main index to study the similarity of the sequences,and the research is carried out from the following three aspects:Firstly,a normalized co-occuring word frequency method,Jaccard,is used to calculate the sequence similarity,and then the co-occurrence word frequency technology is combined with the graph model to calculate the sequence weights,finally the sequence weights are applied for sequence clustering The main results are presented as follows(1)A novel text similarity method based on the improved Jaccard coeff-icient is proposed.Comparing to the state-of-the-art methods,it increases the accuracy of sequence similarity.Through the text preprocessing,the text is segmented to K-Grams(k-mers)by a sliding window.And the frequency of each k-mer in each document is counted.The comparison is occurred between two texts.By normalizing the co-occurrence k-mers word frequency existing between two texts,and analyzing the frequency of each k-mer in two texts,the similarity of them based on the improved Jaccard coefficient is obtained.The data used in this experiment is the corpus provided by Sogou Laboratory.The results verify the validity of the Jaccard coeff-icient proposed method.The relation between the length of k-mers and the similarity is also explored.It shows that there exists a clear linear relation between the ratio of character repetition and the similarity of the two texts.The accuracy of the similarity can also be increased by increasing the length of k-mers(2)SeqRank is a novel method to calculate the weight of sequences based on a graph model.Based on the idea of bipartite graph,the co-occurring word frequency technology is combined with the graph model,and a sequence weight calculation algorithm,SeqRank,is proposed to calculate the importance of the sequence.Then the similarity of the sequence under one-dimensional projection is processed to verify the characteristics of the SeqRank algorithm.It shows that when the sequences are similar,the weights are similar.The experimental result based on the data set MLST-8 also supports:similar sequences usually in the same cluster.This result fully verifies the characteristics of the SeqRank algorithm proposed in this work.(3)A novel clustering algorithm is proposed based on SeqRank algorithm.This work proposes a sequence clustering algorithm based on sequence weights.The algorithm builds a sequence-k-mers bipartite graph on the premise of grouping k-mers Firstly,the importance of the sequence is calculated by using k-mers and is sorted in reverse order.Then k(the number of centers)sequences are selected from each group evenly,and after de-duplication,these candidate sequences are acted as centers.Then,K-means clustering is applied.For each cluster,the point which is the closest to the current centroid is selected as the sequence center.Finally,the frequency of k-mers in all sequences is taken as the feature,and K-means clustering is performed again to obtain the final clustering result.The effectiveness of the SeqRank clustering algorithm proposed in this thesis is fully demonstrated by comparing the F1 value and the operational time with the state-of-the-arts sequence comparison methods,such as Afcluster,QCluster and USEARCH and SSAW.
Keywords/Search Tags:Graph Model, Sequence Similarity, Weight Calculation, Clustering, Bipartite Graph
PDF Full Text Request
Related items