Font Size: a A A

Research On Word Embedding Algorithm Using Count-based Models

Posted on:2018-09-14Degree:MasterType:Thesis
Country:ChinaCandidate:N PeiFull Text:PDF
GTID:2348330512973281Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Word Embedding is a key technology in the field of natural language processing to map words in text into low-dimensional numerical vectors.The trained word vectors can be input to the supervised learning algorithm for certain tasks as a complete word feature,or as a useful extension of the specific features that depends on different tasks.Word Embedding is mainly divided into predictive model and count-based model.The idea of the predictive model is to construct word vector based on neural network model,but the training time of this kind of model is relatively long and the statistical information is not well used.The idea of the count-based model is to embed the word into a low-dimensional vector space by using the word co-occurrence matrix,so it has better statistical significance and faster training speed.At the same time,the word vector representation obtained by this method has the advantage of capturing the correlation between the words.In this paper,the Word Embedding algorithm based on the count-based model is compared.By selecting two kinds of contexts,two weight calculation methods and two similarity calculation methods,five kinds of Word Embedding algorithms based on the count-based model are studied and compared on word similarity task,then compared with the Skip-gram model based on predictive model.Experimental results show that most of the count-based models can achieve the same or even better performance than the Skip-gram model on word similarity task.Then,this paper designs and implements the Word Embedding system tools based on the count-based model.Given the corpus,by selecting the context environment,the weight calculation method,the similarity calculation method,and setting the vector dimension,this tool can train the corresponding word vector for user,which allows the user to use the word vector for different natural languages processing tasks.Finally,aiming at the shortcomings of being subjective and time-consuming and laborious in the field of information service,this paper constructs the topic expansion method by using the Word Embedding algorithm,and summarizes the application fields of topicextension technology based on Word Embedding through case analysis.This method improves the accuracy and the comprehensiveness of topic extension,and can better assist the users in obtaining information.
Keywords/Search Tags:Word Embedding, Count-based model, Similarity calculation, Word similarity, Topic extension
PDF Full Text Request
Related items