Semantic Enhancement Algorithm Based On Chinese Character Structure

Posted on:2024-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Huang

Full Text:PDF

GTID:2568307181453974

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Natural language processing can enable e-commerce companies to better navigate shopping,social platforms to better recommend news,government departments to effectively monitor public opinion,and so on.Compared to texts such as English,Chinese text is richer in semantics,but more obscure.Both the number of characters in the language and the vocabulary are greater than that of English.The pictograms and pronunciation of Chinese characters also reveal the semantic meaning of the language from a different perspective.Traditional deep learning models applied to English do not fully understand the content of Chinese text,and their generalisation to Chinese text needs to be improved.Text classification is a fundamental problem in natural language processing,and it is necessary for research on Chinese language processing to extract semantic information from Chinese text more effectively and improve the performance of the model by investigating the task of Chinese text classification.The main work of this thesis is as follows.(1)Traditional Chinese text classification techniques do not pay enough attention to the structure and pronunciation characteristics of Chinese characters,and do not make good use of the semantic information carried by the Chinese characters themselves.To address this problem,this thesis proposes a multi-granularity text representation method that uses three input vectors of different granularities to represent sentences.A Chinese word is mapped into stroke n-grams and the word vectors are trained by a Skip-gram model to obtain the corresponding stroke vectors.The convolutional embedding of the pinyin is obtained by encoding the pinyin of the Chinese word.(2)For high-frequency and low-sense words such as "you" and "I",there is no uniform and efficient solution in general neural networks,and for these high-frequency and low-distinction words,this thesis designs a method to calculate the influence of words,i.e.the distinction.And by calculating based on the distinction degree,those words with greater influence are derived and used in the calculation of the attention mechanism.The feasibility of the distinction-based attention mechanism proposed in this thesis is verified through comparison experiments with methods such as dot product attention and self-attention.(3)Traditional text classification models with pooling layers suffer from information loss due to the use of max-pooling and other forms.In this thesis,a capsule network model incorporating the structure of Chinese characters is proposed,in which a capsule layer is used to associate text features.The vectors of different granularities of text are classified by the capsule layer.To address the three problems of insufficient extraction of deep semantic information such as the structure of Chinese characters and polysyllabic words,how to reduce the influence of low-discrimination words,and data loss in the pooling layer of the neural network,this thesis proposes a semantic enhancement algorithm that fuses structural information of Chinese characters and validates it on a Chinese text classification task.The model outperforms traditional models such as Text CNN and RNN in terms of experimental results,and also outperforms models that incorporate Chinese character features such as RAM and RAFG.The experimental results demonstrate that the semantic enhancement algorithm proposed in this thesis can understand Chinese texts more effectively.

Keywords/Search Tags:

text representation, multigranularity, attention, text classification

PDF Full Text Request

Related items

1	Research Of Text Classification Based On Word2vec And Self-attention
2	Research On Label-aware Text Classification Methods
3	Research On Key Techniques Of Short-text Representation And Classification Based On Hybrid Semantic
4	Research On Chinese Short Text Representation And Classification
5	Research On Text Representation And Classification Based On Neural Networks And Self-attention Mechanism
6	Research And Implementation Of Text Representation In Continuous Sapce
7	Algorithm Research On Text Classification And Named Entity Recognition Based On Deep Text Feature Representation
8	Research And Application Of Text Classification Algorithm Based On Label Embedding And Self-Interaction Attention
9	The Research On Local Smooth Preserving Of Manifold Regularization Auto Encoder For Text Representation
10	Research On Short Text Classification Method Based On Text Graph Structure