| In recent years,with increasing daily communication,acquisition and storage of Chinese information on the Internet platform,there emerge a large number of non-standard short texts at the same time.Most of these short texts come from Pinyin typing and voice input,with incomplete grammatical structure and many homophone typos,which brings certain problems to data analysis and processing.Therefore,it has always been an important task in the field of natural language processing to process these texts timely and effectively and extract accurate information.The calculation of short text similarity is one of the vital topics.Because it is time-consuming and labor-intensive to label short text similarity data set manually,the unsupervised short text similarity algorithm needs to be focused on.However,the accuracy of unsupervised short text similarity is not high enough,so this paper analyzes and studies the unsupervised Chinese short text similarity algorithm,and proposes improved algorithms.In this paper,two improved algorithms are proposed to enhance the accuracy of unsupervised Chinese short text similarity algorithm.This paper makes a detailed study and analysis of several classical algorithms of two kinds of similarity algorithms based on semantics and space vector,conducts a lot of comparative experiments with the two algorithms proposed in this paper,and verifies them by LCQMC,a large-scale Chinese problem matching corpus.The first algorithm proposed in this paper is feature expansion algorithm based on Chinese Synonym Dictionary.This algorithm can be used to solve the problem of feature sparsity in short text vector.This paper adopts the external semantic knowledge base Chinese Synonym Dictionary to do the feature expansion algorithm of short text,and uses five common classical space vector models to make comparative experiments.Experimental results show that the feature expansion algorithm can effectively improve the accuracy of short text similarity,and the accuracy and F1-score of the five classical space vector algorithms are improved by about 3%.The second algorithm proposed in this paper is fusion algorithm based on word pronunciation and meaning.This algorithm is mainly aimed at the colloquial short text with incomplete structure and homophone typos.This algorithm takes the word pronunciation,character,meaning(including word order,part of speech,etc.)of short text as text features,and constructs feature vectors through these features to calculate the similarity of each vector,and then calculates the comprehensive semantic similarity value of the text through the fusion algorithm.Finally,a fusion similarity calculation method integrating the word pronunciation,character and meaning is achieved.The accuracy and F1-score of this method on LCQMC data set are 83.6%and 85.8% respectively,which are 6.5% and 9.8% higher than those of sentence vector algorithm based on TF-IDF weight,and 5.3% and 8.3% higher than those of sentence vector algorithm based on SIF.The comparative experiments show that this algorithm is a reliable unsupervised short Chinese text similarity algorithm. |