Research On WS-LDA Topic Model Based On Word Embedding And Semantic Similarity

Posted on:2022-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:W Y Zheng

Full Text:PDF

GTID:2568306323971909

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In today’s society,large-scale interactive display of network information has led to an exponential increase in data volume.Among them,text data is the data form that we are exposed to most in daily life,and the most common type is news data.Text mining on large-scale documents can quickly get the main themes of the text.Mining the topics contained in massive texts is of great significance in actual scene such as improving the efficiency of information reception,precise search,and text classification.The LDA topic model is a classic text topic mining method.It is an unsupervised model.Through training,the topic probability distribution of all documents in the document set and the word probability distribution of each topic can be obtained.However,the traditional topic model has many defects,which are mainly divided into the following aspects:(1)The useless words with high similarity to the stop words in the original corpus interfere with the model.(2)The traditional LDA model extracts topics based on word frequency information such as word co-occurrence,which will result in sparse representation of word vectors,leading to a waste of memory.(3)The traditional LDA model does not pay enough attention to the semantic information of the text.The LDA model only starts from the global semantics and does not consider the contextual semantic information of the text.Based on the above problems,this paper proposes a WS-LDA topic model based on word embedding and semantic similarity,which mainly introduces three optimization strategies.(1)The filtering algorithm of stopwords based on word embedding similarity calculation,this algorithm can get a list of words with higher similarity to the original stop words,thereby expanding the stop word set,according to the expanded stop words.The operation of removing stop words on the original corpus can effectively improve the quality of the semantic representation of the text.(2)Introduce the word2vec model to represent word vectors.This model can map words with similar contexts to vectors with similar semantics,and can reduce the dimension of word vector representation,and at the same time solve the semantic context information when text word vectors represent words Shortcomings of insufficient and vector sparsity.(3)Combine the word2vec model and the LDA model to train the text,and combine the local semantics(word embedding)of the text with the global semantics(topic model),which can effectively improve the quality of the semantic representation of the text,and obtain the text topic extraction result with higher semantic accuracy.In this article,through a large number of experiments on news data sets,we proved that the WS-LDA model can obtain more accurate topic extraction results.

Keywords/Search Tags:

LDA Topic Model, Word Embedding, Data Mining, Semantic similarity

PDF Full Text Request

Related items

1	Research On Short Text Topic Information Mining Technology
2	Improved Text Topic Representation And Learning Method
3	Research On Word Embedding Algorithm Using Count-based Models
4	Research On Text Topic Modeling Based On Word Embedding
5	Research On Evolution Model Of Microblog Topic Based On Time Sequence
6	Research On Topic Model Over Short Texts With Incorporation Of Word Embedding
7	Research On Short Text Topic Model Based On Semantic Information And Word Triangle
8	The Research On Short Text Semantic Mining Based On Topic Model And Word Vector
9	Word Similarity Measurement Based On Word Embedding And WordNet
10	Automatic Construction Method For Domain Concepts Based On Wikipedia Semantic Knowledge Base