Font Size: a A A

Research And Application Of The Word Embedding Method

Posted on:2019-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:F XuFull Text:PDF
GTID:2348330542958069Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Word Embedding uses low-dimensional dense vector to represent words and reflect the relationship between words through vector operations.Hence it is widely applied in natural language processing tasks.As a research hotspot in the field of natural language processing,the research of Word Embedding has been greatly promoted by researchers.However,there are two problems with this technology:(1)How to choose the appropriate algorithm to construct Word Embedding;(2)What are the factors that determine the quality of Word Embedding and how to improve the quality of Word Embedding.For the problem of choosing the appropriate algorithm to construct Word Embedding,this paper researches and builds the Word Embedding method based on matrix factorization.We compared the constructed model with Skip-gram model and GloVe model under the word similarity task in different windows.Experimental results show that in the process of constructing the matrix factorization model,the similarity method using cosine similarity is superior to that using Hellinger distance,the weighting method using conditional probability is better than that using word frequency,and it is found that the quality of similarity matrix before dimensionality reduction has a linear correlation with the quality of Word Embedding.To determine the quality factors of Word Embedding,and improve the quality of Word Embedding,this paper proposed a method of Word Embedding based on similarity matrix centralization.With this method,the similarities between similar words are relatively enhanced,and the similarities between different words are relatively weakened.The effectiveness of this method is verified in the word similarity task.Experimental results show that the quality of Word Embedding is improved by the similarity matrix centralization method and can reach or exceed the Skip-gram model.Centralization can improve the quality of similarity matrix before dimensionality reduction,and then improve the quality of Word Embedding.This paper implemented the Word Embedding system based on the centralization method,and set different parameters to perform training on the corpus.The generated Word Embedding was applied to the Chinese Named Entity Recognition task.Experimental results show that the centralization method can make better use of the context to improve the recognition effect.
Keywords/Search Tags:Word Embedding, Matrix Factorization, Similarity Matrix Centralization, Chinese Named Entity Recognition
PDF Full Text Request
Related items