Font Size: a A A

Research On The Semantic Representation And Learning Of Chinese Words

Posted on:2022-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2518306524489644Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
As the fundamental of many natural language processing tasks,the semantic rep-resentation of words has become a research hotspot in recent years.A large number of primary research aimed at symbolic languages such as English and German.However,Chinese as an ideographic language has its own unique characteristics.Therefore,some Chinese researchers use fine-grained features such as Chinese characters,radicals,and components to optimize Chinese semantic representation,making word representation more effective in Chinese natural language processing tasks.However,these existing Chinese semantic representation algorithms only focus on the original characteristics of words,and they have not deeply explored the semantic connections between words and the more fine-grained features of words.Furthermore,there are many incorrect characters in the Chinese corpus,but the existing methods do not consider this situation.The noise information greatly limits the effect of word embeddings and finally affect the downstream tasks.The main content and innovations of this paper can be divided into the following three aspects:Firstly,the existing methods only pay attention to the local context information in the fixed window of words during the process of learning word representation and ignore global context information outside the window.In this paper,we propose the concept of semantic neighbor and soft sampling for the first time,and then propose the method named Global Semantic Neighbor(GSN).This method uses global co-occurrence information and character-binding to construct a global semantic neighbor graph.After that,learning the semantic relationship between word pairs in the graph by soft sampling.In the end,the representation effect is improved.Secondly,the original fine-grained features cannot effectively capture the semantic of words.In this paper,we propose a method that uses the n-gram information of word components to better represent the character information in words.This method extracts the n-gram information of components as new features,while learning the original features of the word and the new feature sequence can more effectively represent the semantic information of words.Thirdly,in order to solve the problem of incorrect characters in the corpus,which will affect the performance of word embeddings.In this paper,we propose a method that uses components and pinyin to construct n-gram information as additional features.In the processing of learning word embedding,using these additional features and original features can reduce the impact of incorrect characters to make word embeddings more efficient and robust.
Keywords/Search Tags:Chinese words, semantic representation, contextual information, components
PDF Full Text Request
Related items