| With the continuous development of information technology,more and more information is now presented in the form of electronic text data.Facing the explosive growth of data volume,automatically extracting the required information efficiently and quickly has become a hot spot in natural language processing research.The popularity of the mobile Internet has made short texts the main body of electronic texts.Therefore,the use of short text corpus for semantic similarity research has a large amount of research materials and extensive application value.The task of judging short text semantic similarity refers to judging whether they express similar meanings from a semantic level for a given set of sentence pairs.The text similarity judgment can be regarded as a similar or dissimilar binary classification problem.At present,the algorithm research on short text semantic similarity determination tasks mainly focuses on the research of text representation methods.Among them,the BERT pre-training model based on deep learning has been deeply used in many tasks due to its flexible training methods and powerful representation capabilities.Research and application.In this paper,in order to improve the short text representation ability of the BERT pre-training model,two models,BERT_RF_S and Topic_BERT_S,are proposed.The main improvements are as follows:1.When the BERT pre-training model is applied to the similarity judgment task of short text,it will limit its ability to represent text due to the insufficient number of samples.In response to this problem,this paper proposes the BERT_RF_S model,which uses the fast gradient method to generate noise samples for input training to achieve representation enhancement and improve the model’s characterization ability.2.The BERT pre-training model only encodes the contextual semantic information of the text,and lacks topical information that can summarize the overall situation.In order to form a more comprehensive text representation,this paper proposes a topic model based on a variational autoencoder.The model can be trained unsupervised,and the generated topic information representation can be fused with semantic representation to make up for the lack of topic information at the word level.3.When fusing the semantic representation and the topic representation generated by the topic model,the effect is not ideal due to the different convergence speeds of different models.In order to solve this problem,this paper proposes the Topic_BERT_S model based on multi-task learning,which combines the supervised model and the unsupervised model,and learns the semantic information and topic information of the text.Finally,the BERT_RF_S model and the topic model can converge to the maximum at the same time.Excellent,to generate a text feature representation with more comprehensive information for short text semantic similarity determination.Finally,the Topic_BERT_S model is applied to the Chinese and English standard data sets in the fields of news,finance,and medicine.The results show that due to the improvement of the Topic_BERT_S model in representation enhancement,introduction of topic information,and multi-task learning,the accuracy of similarity determination is obvious.It is better than the model before the improvement,and compared with the current advanced algorithm,the Topic_BERT_S model is also at the mainstream level. |