Font Size: a A A

A Representation Method Of Chinese Characters And Words Based On Word-Character Alignment

Posted on:2018-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:J XuFull Text:PDF
GTID:2348330512982614Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the minimum-used semantic unit in texts,many natural language processing tasks have to deal with word representation problems.The most common way to rep-resent a word is one-hot representation.However,this method has data sparsity prob-lem and is not able to capture semantic relationship among the words.Word repre-sentation methods based on representation learning aims to represent the information of word using lower-dense vector.Distributional word representation methods are the most common method among them.The word vectors produced by this kind of method has achieved great success in many natural language processing problems.Motivated by word representation methods of English,some researchers have done some research in the representation of Chinese words.Recently,some researchers demonstrate that the component characters in Chinese word also provide rich semantic information.The joint learning model of character and word achieves certain degree of success in some Chinese natural language processing tasks.However,they ignored the semantic sim-ilarity across component characters in a word,which causes undesirable performance in some Chinese natural language processing task.In this paper,we learn the semantic contribution of characters to a word by exploiting the similarity between a word and its component characters with the semantic knowledge obtained from other languages.We propose a similarity-based method to learn Chinese word and character embeddings jointly.This method is also capable of distinguishing non-compositional Chinese words and disambiguating Chinese characters.The main work of this thesis can be summarized as follows:(1)We propose a method to learn word and character vectors jointly based on word and character similarity according to the features of Chinese words.In the training.we calculate the semantic contribution of characters to the word.It betterly models Chinese words with the smoothing effect of Chinese characters and enriches the context information of'a word.(2)Different from the traditional Chinese words disambiguation process based on context clustering,in this paper,we propose a new method to disambiguate Chinese characters,which utilizes the translation resources.This method utilizes the external resources,and disambiguates Chinese characters in a way like K-means.(3)Not all the Chinese words are semantically compositional.For instance.entity names,transliterated words and etc.Based on the model,we propose a way to identity non-semantically compositional words.(4)The experiment results on different datasets and different perspectives of eval-uation demonstrate the effectiveness of our methods.
Keywords/Search Tags:natural language processing, representation learning, semantics, Chinese word embedding, Chinese characters disambiguation
PDF Full Text Request
Related items