Font Size: a A A

Research On Chinese Word Vector Based On Internal Information Of Words

Posted on:2021-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:W C TanFull Text:PDF
GTID:2415330623467964Subject:Statistics
Abstract/Summary:PDF Full Text Request
Word vectors play a very important role in various tasks of natural language processing.Their essence is the vectors obtained by mapping words or phrases in the corpus to real number space.Word vectors are the foundation of the field of natural language processing.The quality of word vectors directly affects the effectiveness of various natural language processing tasks.Therefore,the research of word vectors has always been a focus and a hot topic.The research and development of English word vectors are relatively early,resulting in many important models,such as Bengio’s model,word2 vec model,fasttext model.The study of Chinese word vectors started late,and most of the study of Chinese word vectors is based on the existing English word vector model.For the study of Chinese word vectors,the most important thing is to understand the difference between Chinese and English.Each word in English is composed of 26 letters.The text itself does not contain semantic information,while Chinese is a hieroglyph,which itself There is a lot of semantic information,so the research of Chinese word vectors also focuses on how to use the semantic information inside Chinese words.This paper first uses the characteristics of Chinese character hieroglyphs to improve the Chinese word vector from the two aspects of Chinese character shape and pronunciation.The glyph mainly considers each Chinese character that composes Chinese words.The pronunciation considers the pinyin without tone of each Chinese character,and then uses the CBOW model to obtain the word vector and the pinyin vector,and then directly the word vector and the pinyin vector with the word vector obtained by the CBOW model.Add up to get three sets of word vectors.There are two main methods for evaluating the quality of word vectors,which are the word similarity task and the analogy inference task.Among the three groups of word vectors,the word phonetic word joint word vector performs best in the word similarity task,compared with the single The word similarity scores of the word vectors obtained by the CBOW model in the two evaluation files were increased by 9.76% and 3.14%,respectively.At the same time,the word phonetic word combined word vectors were compared to the three types of word vectors obtained by the CBOW model in the analogy inference task.The scores on the relationship have increased by more than 20%.In this step,the paperconsiders that the influence of both shape and pronunciation on the meaning of words is different.Therefore,this paper uses the word similarity task to score the initial word vector,the initial pinyin vector,and the initial word vector respectively.Weight division,adding weights to the three groups of vectors separately,and then adding and processing,so as to obtain new word vectors.The new two sets of word vectors have relatively better performance in the two types of word vector evaluation tasks than the unweighted word vectors,where the new word vectors are compared to the unweighted word vectors in the word similarity task The word similarity scores on the two datasets were increased by a maximum of 1.56% and 1.97%,while the maximum improvement in the analogy inference task reached 4.71%.
Keywords/Search Tags:Word vector, hieroglyph, CBOW, weight algorithm
PDF Full Text Request
Related items