| In today’s society,large-scale interactive display of network information has led to an exponential increase in data volume.Among them,text data is the data form that we are exposed to most in daily life,and the most common type is news data.Text mining on large-scale documents can quickly get the main themes of the text.Mining the topics contained in massive texts is of great significance in actual scene such as improving the efficiency of information reception,precise search,and text classification.The LDA topic model is a classic text topic mining method.It is an unsupervised model.Through training,the topic probability distribution of all documents in the document set and the word probability distribution of each topic can be obtained.However,the traditional topic model has many defects,which are mainly divided into the following aspects:(1)The useless words with high similarity to the stop words in the original corpus interfere with the model.(2)The traditional LDA model extracts topics based on word frequency information such as word co-occurrence,which will result in sparse representation of word vectors,leading to a waste of memory.(3)The traditional LDA model does not pay enough attention to the semantic information of the text.The LDA model only starts from the global semantics and does not consider the contextual semantic information of the text.Based on the above problems,this paper proposes a WS-LDA topic model based on word embedding and semantic similarity,which mainly introduces three optimization strategies.(1)The filtering algorithm of stopwords based on word embedding similarity calculation,this algorithm can get a list of words with higher similarity to the original stop words,thereby expanding the stop word set,according to the expanded stop words.The operation of removing stop words on the original corpus can effectively improve the quality of the semantic representation of the text.(2)Introduce the word2vec model to represent word vectors.This model can map words with similar contexts to vectors with similar semantics,and can reduce the dimension of word vector representation,and at the same time solve the semantic context information when text word vectors represent words Shortcomings of insufficient and vector sparsity.(3)Combine the word2vec model and the LDA model to train the text,and combine the local semantics(word embedding)of the text with the global semantics(topic model),which can effectively improve the quality of the semantic representation of the text,and obtain the text topic extraction result with higher semantic accuracy.In this article,through a large number of experiments on news data sets,we proved that the WS-LDA model can obtain more accurate topic extraction results. |