Text similarity calculation refers to the comparison of similarity between two or more entities(words,short texts,documents)through a certain strategy to obtain a specific quantitative value.With the rapid development of information technology,the Internet the massive amounts of text information mining and research can provide valuable content to users,such as the classification of text clustering,information extraction,personalized recommendation,search engine,etc.,and text similarity is used to measure the difference and commonness between text,it is also the key link in these tasks.Sequence alignment algorithm comes from the field of bioinformatics.At present,scholars at home and abroad have applied sequence alignment algorithm to text similarity calculation,and the method has a good effect on time series data and streaming data.However,with the continuous development of the network and information technology,the number of text resources on the Internet is increasing exponentially.How to apply sequence alignment algorithm to mine meaningful content from these huge text resources still needs to carry out a more in-depth analysis and research on sequence alignment algorithm.Sequence alignment algorithm utilized text similarity calculation,is to put the two Chinese text segmentation post-processing is in the form of words according to the order of sequence,two Chinese and two Chinese sequence is arranged together to compare their similarities,sequence can be inserted into the space character to make as much as possible in the two sequences of the same or similar words in the same column,in the end,the recursion to obtain the optimal solution to calculate the similarity of two sequences.In view of the existing problems and defects in relevant researches,this paper conducts a more in-depth study in order to better apply sequence alignment algorithm to Chinese text information mining reasonably and effectively.Firstly,based on the modeling tools in the Natural Language Processing,the standardized scoring matrix of Chinese sequence sets and word pairs is constructed to improve the effectiveness and rationality of comparison scoring.On the basis of global alignment research,the local alignment algorithm and multiple local alignment algorithm are innovatively used to compare the corresponding Chinese sequences,so as to improve the applicability and accuracy of sequence alignment algorithm applied to Chinese text similarity calculation.Finally,the optimal solution was obtained,the alignment path of the optimal solution was traced back,and the similarity of the two Chinese sequences was calculated.To verify the effectiveness of this method,this paper first collects data from online academic resources and online health websites to construct a standardized Chinese sequence.Then,using the latest Chinese Wikipedia corpus,Word2vec was trained to construct the scoring matrix of word pairs.Finally,based on the scoring matrix and scoring rules,the proposed method is used to calculate the similarity between Chinese sequences,The empirical results show that,compared with the traditional method,the proposed method is very effective in improving the accuracy and effectiveness of the sequence alignment algorithm in calculating the similarity of Chinese texts. |