Font Size: a A A

Research On Text Topic Modeling Based On Word Embedding

Posted on:2021-02-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z H CaoFull Text:PDF
GTID:1488306119452814Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
There are massive amount of electronic documents on the Internet,moreover,it is still growing rapidly.Those documents contain a lot of valuable information.Topic model can discover topic semantics from a large number of documents,thus,people can better manage and analyze the a collection of documents to mine interesting or meaningful information,which has important theoretical research value and practical application value,and is a research hotspot in the field of text data management.Topic model can be applied to mine topic semantic words in a collection of documents.Through those topic words,the main semantic of documents content can be easily understood.Meanwhile,after projecting documents to the distribution of topic word,the semantic of each document can be retrieved.Due to its unsupervised learning way,topic model has been widely used in many domains of text management and analysis.For example,the results obtained by the topic model have been used in information retrieval,documents classification,sentiment analysis,automatic text summary,opinion analysis,hot topics discovery and tracking,et al.Most probabilistic topic models have assumption about the generation of documents,which conforms to the Bag Of Words(BOW)model.The BOW only takes into consideration the word frequency within documents,and ignores the sequence of words.Those ways reduce the complexity of the model,and result in the model focusing only on acquiring the word's co-occurrence in the documents,lack of semantic information,like word similarity and sequence,etc.Hence,the model's effect is not ideal.Moreover,in the early stages,the research of topic model mostly focused on reducing the model's perplexity,which is the evaluation method of topic model,is inconsistent with people's understanding.Based on this,researchers proposed the evaluation of topic coherence,which can better represent the semantic of topic word sequence.Word embedding has rich semantic information.In recent years,it is a key research direction to apply word embedding on topic mining.The thesis first study the characteristics of word embedding,and improve the negative sampling method used in original training model of word embedding to improve its semantic quality.Next,this thesis proposes topic models based on word embedding's similarity and relevant features to mine better topic results.The major contributions and contents of this thesis are as follows:(1)The similarity of word embedding and improved the negative sampling method was analyzed based on word Point-wise Mutual Information(PMI).Studies have found that most word embedding models will generate two word embedding vectors for each word,which are used as input or output targets to represent the different functions of the word during model learning.Those word embedding shows great similarity.PMI information can represent the relationship between word input and output embedding,thus helping to analyze the reasons for similarity and relevance of word embedding.To speed up the training of model,Skip-Gram and CBOW take hierarchical Softmax structure or Negative sampling to obtain an approximate solution.Negative sampling is a better efficient method for model training,but the original model has some problems,for example,all words share unique sampling table information,and negative samples are mainly concentrated in high-frequency words.In this thesis,positive or negative PMI is studied to constructs unique negative sampling vocabulary for each word,and applies pre-sampling method to gain lower memory usage.The experimental results demonstrate there is a big similarity between the input and output embedding of each word,and the semantic quality of word embedding will be improved based on PMI.(2)Word embedding's similarity was studied to obtain relevant word set,which is applied to constructs constraints on the model of sparse topic coding to realize General Hierarchical Sparse Topic Coding(GHSTC)and Sparse Hierarchical Sparse Topic Coding(SHSTC).There are many parameters for topic model,and the parameters are closely related,thus it is very difficult to obtain its solution.In addition,topic model is lack of inter-word association information.Removing the interdependent assumption while adding the links between words,is a common way to improve the topics' s quality.Although sparse topic coding simplifies the representation of model parameters,it also ignores links between words.GHSTC and SHSTC obtain word sets with related semantics through word embedding.Word codes in related word sets are expressed as sparse constraints in the hierarchy and act on the topic coding model.In such way,through word coding sparse and relevancy,the distribution of topic words can be also more sparse and relevance.Hence,the semantic of documents would be more accurate,and improve the effect of topic coding.The experimental results demonstrate GHSTC and SHSTC can improve the effect of topic coding.(3)The Skip-Gram structure and Word Embedding-Topic Model(SGWE-TM)was constructed using the neural network structure and pre-trained word embeddings.Word embeddings have very rich semantic of words,the neural network structure is very suitable for the usage of word embedding.However,most methods using neural network structure do not leverage well topic feature of word embedding.Most word embedding model is implicitly factorizing a word-context matrix,whose cells are PMI of words and its context pairs.While the evaluation of topic model also use PMI of topic words.Hence both ways are quite relevant.In SGWE-TM model,the Softmax function and the Skip-Gram structure are introduced,the front links topic embedding and word embedding,and the second describes the generation between the center word's topic and its adjacent words.Then,the similarity and relevance features of the word embedding are implicit in the topic modeling.The experimental results demonstrate the coherence value of SGWE-TM model can be significantly improved,and the relation between topic and topic word cab be obtained.(4)T-Skip-Gram and T-CBOW,which combine topic model and Skip-Gram(CBO W),and can learn word embeddings and latent topics in a unified manner,was proposed.Topic model can discover words with multiple semantic.Some studies apply word's topic to acquire word embedding with multiple prototypes.Meanwhile,some topic models apply word embedding's similarity and relevance to mine text topic.They all use the pipeline mode,the latter uses the results of the former,and can't feedback to adjust the parameters of the former.Jointly learning corpus' s topic information and word embedding has important research significance.The model can combine the advantages of both at the same time to achieve multi-prototype word embedding using the topic information of words,and learn documents' topic using words and topics embedding.After the topic value was embedded,T-CBOW and T-Skip-Gram naturally obtains the variation distribution of the words using topic and word embedding.Meanwhile,the topic embedding and word embedding are used to predict the generation of adjacent words in the documents,and addresses the issue of polysemy.Thus the model can been implement jointly training,simultaneously obtain the document topic,word and topic embedding.The experimental results demonstrate that T-Skip-Gram and T-CBOW model can generate context-aware word embeddings and coherent latent topics in an effective and efficient way.
Keywords/Search Tags:Topic Model, Word Embedding, Negative Sampling, Point-wise Mutual Information, Sparse, Topic Coding, Topic Coherence, Joint Learning
PDF Full Text Request
Related items