Research On Word Embedding Algorithm Using Count-based Models

Posted on:2018-09-14

Degree:Master

Type:Thesis

Country:China

Candidate:N Pei

Full Text:PDF

GTID:2348330512973281

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Word Embedding is a key technology in the field of natural language processing to map words in text into low-dimensional numerical vectors.The trained word vectors can be input to the supervised learning algorithm for certain tasks as a complete word feature,or as a useful extension of the specific features that depends on different tasks.Word Embedding is mainly divided into predictive model and count-based model.The idea of the predictive model is to construct word vector based on neural network model,but the training time of this kind of model is relatively long and the statistical information is not well used.The idea of the count-based model is to embed the word into a low-dimensional vector space by using the word co-occurrence matrix,so it has better statistical significance and faster training speed.At the same time,the word vector representation obtained by this method has the advantage of capturing the correlation between the words.In this paper,the Word Embedding algorithm based on the count-based model is compared.By selecting two kinds of contexts,two weight calculation methods and two similarity calculation methods,five kinds of Word Embedding algorithms based on the count-based model are studied and compared on word similarity task,then compared with the Skip-gram model based on predictive model.Experimental results show that most of the count-based models can achieve the same or even better performance than the Skip-gram model on word similarity task.Then,this paper designs and implements the Word Embedding system tools based on the count-based model.Given the corpus,by selecting the context environment,the weight calculation method,the similarity calculation method,and setting the vector dimension,this tool can train the corresponding word vector for user,which allows the user to use the word vector for different natural languages processing tasks.Finally,aiming at the shortcomings of being subjective and time-consuming and laborious in the field of information service,this paper constructs the topic expansion method by using the Word Embedding algorithm,and summarizes the application fields of topicextension technology based on Word Embedding through case analysis.This method improves the accuracy and the comprehensiveness of topic extension,and can better assist the users in obtaining information.

Keywords/Search Tags:

Word Embedding, Count-based model, Similarity calculation, Word similarity, Topic extension

PDF Full Text Request

Related items

1	Research On WS-LDA Topic Model Based On Word Embedding And Semantic Similarity
2	Word Similarity Measurement Based On Word Embedding And WordNet
3	Research On Text Topic Modeling Based On Word Embedding
4	Research On Evolution Model Of Microblog Topic Based On Time Sequence
5	Writing Assistant System Based On Topic Recommendation
6	The Research Of Word Similarity Calculation Based On Web Text And Automatic Generation Technology Of Traffic Terms
7	Improved Text Topic Representation And Learning Method
8	Research On Calculation Method Of Chinese-Thai Cross-language Sentence Similarity Based On Word Embedding
9	Research On Short Text Topic Information Mining Technology
10	Research On Chinese Sentence Similarity Calculation Based On Deep Learning