Research On The Semantic Representation And Learning Of Chinese Words

Posted on:2022-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:C Xu

Full Text:PDF

GTID:2518306524489644

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

As the fundamental of many natural language processing tasks,the semantic rep-resentation of words has become a research hotspot in recent years.A large number of primary research aimed at symbolic languages such as English and German.However,Chinese as an ideographic language has its own unique characteristics.Therefore,some Chinese researchers use fine-grained features such as Chinese characters,radicals,and components to optimize Chinese semantic representation,making word representation more effective in Chinese natural language processing tasks.However,these existing Chinese semantic representation algorithms only focus on the original characteristics of words,and they have not deeply explored the semantic connections between words and the more fine-grained features of words.Furthermore,there are many incorrect characters in the Chinese corpus,but the existing methods do not consider this situation.The noise information greatly limits the effect of word embeddings and finally affect the downstream tasks.The main content and innovations of this paper can be divided into the following three aspects:Firstly,the existing methods only pay attention to the local context information in the fixed window of words during the process of learning word representation and ignore global context information outside the window.In this paper,we propose the concept of semantic neighbor and soft sampling for the first time,and then propose the method named Global Semantic Neighbor(GSN).This method uses global co-occurrence information and character-binding to construct a global semantic neighbor graph.After that,learning the semantic relationship between word pairs in the graph by soft sampling.In the end,the representation effect is improved.Secondly,the original fine-grained features cannot effectively capture the semantic of words.In this paper,we propose a method that uses the n-gram information of word components to better represent the character information in words.This method extracts the n-gram information of components as new features,while learning the original features of the word and the new feature sequence can more effectively represent the semantic information of words.Thirdly,in order to solve the problem of incorrect characters in the corpus,which will affect the performance of word embeddings.In this paper,we propose a method that uses components and pinyin to construct n-gram information as additional features.In the processing of learning word embedding,using these additional features and original features can reduce the impact of incorrect characters to make word embeddings more efficient and robust.

Keywords/Search Tags:

Chinese words, semantic representation, contextual information, components

PDF Full Text Request

Related items

1	The Research Of Chinese Words Segmentation Algorithm Based On Statistics And Semantic Information
2	Research On Middle Semantic Representation Based Image Scene Classification
3	Research On Video Abnormal Event Detection In Complex Scenes Based On Mid-level Semantic Representation
4	Research On Unknown Word Processing In Neural Machine Translation
5	Contextual Visual Feature Representation And Application
6	Research On Normalization Of Microblog Text Based On Distributed Semantic Representation
7	Research On Semantic Image Classification Based On Contextual Information
8	The Representation Of Chinese Semantic Knowledge And Its Application In The Chinese-English MT System
9	Semantic Web for Everyone: Exploring Semantic Web Knowledge Bases via Contextual Tag Clouds and Linguistic Interpretations
10	Research On Quantitative Analysis Of Fuzzy Semantic In Chinese Emotional Words And Its Application