Font Size: a A A

Research Of Sequence Clustering Algorithm Based On Weighted Similarity

Posted on:2015-11-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:D WuFull Text:PDF
GTID:1228330422470561Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the extensive development in the fields such as software safety analysis, DNAsequence analysis, sequence clustering analysis has become one of the most promisingresearch directions in data mining. In this paper, the shortages of similarity measurement,initial centers selection and center sequences updation were analyzed. The research ofthese problems has important significance for improving the performances of existingsequence clustering algorithms.Firstly, sequences have need to be preprocessed for equal-length vectors beforeclustering, a sequential pattern mining algorithm based on prefix analysis is proposed. Ifthe number of sequential pattern is smaller than the minimum support count, it will beabandoned directly. The number of repeated scanning in projection database is reduced. Aprefix-based incremental prefixspan algorithm is presented for dealing with dynamicsequence database. The frequent pattern mining results of the original sequence databaseare used fully for improving the mining efficiency in updated sequence database.Secondly, to solve the problem that the existing sequence similarity measurement hashigh cost of time, a sequence element similarity and Top-K maximal frequent sequencebased K-means sequence clustering algorithm is introduced. A bitwise-operators-basedsequence element similarity measurement is designed. According to KMaxmineralgorithm, Top-K maximal sequence patterns are considered to be K initial clusteringcenters. Moreover, all the sequences are converted into equal-length vectors, and suppliedwith β€˜0’and β€˜1’. Then the clustering results are determined.Thirdly, in view of the defect of needing user to input cluster number K, weightedsequence element similarity and K-judgment based K-means sequence clusteringalgorithm is discussed. A weighted sequence element similarity measurement is defined,each sequence is assigned to the most similar cluster. Afterwards, in each cluster,clustering center is updated by the weights of sequences in corresponding cluster.According to comparing with all the values of valid K, the final valid clustering results aregenerated. Fourthly, for the purpose of reducing the impact of initial center selection toclustering result, a weighted sequential pattern similarity and two-stage centers selectionbased K-means sequence clustering algorithm is designed. A new sequence similaritymeasure function based on the weighted sequential pattern is presented. The more thesame important sequential patterns the sequences contain, the higher sequential patternsimilarity they have. According to selecting the mean sequence of the larger similaritysequences set, which meets certain conditions with clustering centers in the T-1-th round,K clustering centers in the T-th round are updated.Finally, frequent sequence clustering algorithm is introduced to analyze softwaresecurity vulnerabilities, and a novel software security vulnerabilities analysis method isproposed. Software security vulnerability sequences are preprocessed by prefix analysisbased sequential pattern mining algorithm. Moreover, by applying weighted sequentialpattern similarity and two-stage centers selection based K-means software securityvulnerability sequence clustering algorithm, K software security vulnerability centers aregained. Through sequence clustering and similarity matching, whether the softwaresecurity suspected vulnerability sequence which needed to analyze is real softwaresecurity vulnerability is determined.The experimental results show that the mining efficiency of the improved frequentsequential pattern mining algorithm is enhanced. Further the accuracy of similaritymeasurement of sequence clustering method can be advanced. The clustering quality,scalability and efficiency are also improved in optimized sequence clustering algorithms.In addition, weighted similarity based sequence clustering are introduced to softwaresecurity vulnerability analysis, which can not only improve analysis efficiency, but alsocan reduce the false positive rate.
Keywords/Search Tags:Sequential pattern, Sequence clustering, Software vulnerability, K-means, Similarity, Weight
PDF Full Text Request
Related items