Font Size: a A A

The Improvement Of Test Set And The Linguistic Evaluation Of Chinese Word Embedding

Posted on:2020-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y T WangFull Text:PDF
GTID:2405330575465434Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
As the core problem of natural language processing,language representation,especially the representation of words,has achieved many results,the most notable of which is word embedding.The word embedding can transform the words in the training corpus into a low-dimensional dense vector form and carry some semantic information,thus having broad application prospects.For any model,effective evaluation methods are crucial.However,the evaluation methods of word embeddings,especially Chinese word embeddings,are not perfect.Firstly,there are fewer evaluation test sets for Chinese word embeddings,and there is room for improvement.Secondly,the evaluation methods of Chinese word embeddings are mostly task-oriented,lacking more intuitive and systematic display and analysis.In order to solve the above problems,this paper analyzes and integrates the existing Chinese similarity test set resources,and solves the problems in test words' selection and the scores in test set.At the same time,by analyzing the characteristics of word embeddings,the semantic relationship in linguistics is adopted.From this perspective,a evaluation method for word embedding models is designed.The specific work content is as follows:The first chapter is the introduction part.Firstly,it introduces the research background of the paper and points out the main body of the research.Then the existing research on word embedding models and the models' evaluation method are investigated and summarized,which leads to the ideas,significance and methods of this paper.On this basis,the paper introduces the work content and organizational structure of the paper.The second chapter mainly introduces the background knowledge of the word embedding models and carries out the practice.The purpose is to provide theoretical support and evaluation samples for the third chapter and the fourth chapter.In the introduction of theory,firstly,the different representations of words are briefly introduced.Secondly,the language model and the basic principles of neural networks are combed.Then the implementation forms of neural network language models are expounded.Finally,the word embedding training tools used in this experiment,whitch named Word2Vec,are introduced.In the practical training of word embeddings,the source and characteristics of the corpus and the reasons for the selection are briefly introduced.Secondly,the preprocessing process of the experimental corpus is demonstrated in detail,including the simplified conversion,word segmentation and removal of stop words.Then the experiment is introduced.Parameters,experimental environment,and the main program and training results of word vector training are displayed.Finally,several word embedding models of this training are introduced and compared.The third and fourth chapters are the main part of this article.The main content of the third chapter is to improve the Chinese similarity manual test set.Through theoretical analysis and questionnaire survey,it is found that there are two main deficiencies in the existing Chinese test set:First,people's scoring of similar words will be affected by the meaning of words,which leads to the similarity of related but dissimilar words get a higher score;secondly,the selection of test sets also has unreasonable phenomena,such as fewer test words and repeated occurrences of some words.In solve the first problem,this paper proposes to use the "HowNet" and"Synonym Word Forest" to correct the score.Firstly,the semantic similarity algorithm based on "HowNet" and the semantic similarity algorithm based on"Synonym Word Forest" are introduced.The effectiveness of these two algorithms is verified theoretically and practically.Then,an improved scheme of manual test set combining the HowNet algorithm and the word forest algorithm is proposed.For the second problem,the inappropriate word pairs in the original test set were deleted and some new word pairs were added.Through the above work,a new similarity Chinese test set,Wordsim306,was finally formed.Finally,the new test set is practiced,and the similarity test shows the difference in quality of different word embeddings in this training.The fourth chapter proposes a new evaluation method for word embeddings.Firstly,according to the distribution hypothesis,the characteristics of word embeddings are analyzed.Then,based on these characteristics,a new evaluation perspective is proposed.From the perspective of linguistic semantic relationship,it is divided into monosyllabic,polysemous,equivalence,synonym,and upper and lower words.In each aspect,the quality of the word embeddings is evaluated by nearest neighbor analysis.Finally,this evaluation method is applied to the word embedding models of this training,and the influence of the corpus size and training method on the word embeddings are analyzed.The fifth part is the conclusion,combing and summarizing the work done in this paper,and expounding the improvement direction and subsequent research work of this research.This paper improves the evaluation method of word embeddings in a variety of ways,and applies it,which provides convenience for the quality evaluation of word embeddings,and enriches the ontology research results of it.This is also a useful attempt to apply linguistic theories to practical problems.
Keywords/Search Tags:word embedding, manual test set, sense relations, nearest neighbor analysis
PDF Full Text Request
Related items