Font Size: a A A

Semantic Enhancement Algorithm Based On Chinese Character Structure

Posted on:2024-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y X HuangFull Text:PDF
GTID:2568307181453974Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Natural language processing can enable e-commerce companies to better navigate shopping,social platforms to better recommend news,government departments to effectively monitor public opinion,and so on.Compared to texts such as English,Chinese text is richer in semantics,but more obscure.Both the number of characters in the language and the vocabulary are greater than that of English.The pictograms and pronunciation of Chinese characters also reveal the semantic meaning of the language from a different perspective.Traditional deep learning models applied to English do not fully understand the content of Chinese text,and their generalisation to Chinese text needs to be improved.Text classification is a fundamental problem in natural language processing,and it is necessary for research on Chinese language processing to extract semantic information from Chinese text more effectively and improve the performance of the model by investigating the task of Chinese text classification.The main work of this thesis is as follows.(1)Traditional Chinese text classification techniques do not pay enough attention to the structure and pronunciation characteristics of Chinese characters,and do not make good use of the semantic information carried by the Chinese characters themselves.To address this problem,this thesis proposes a multi-granularity text representation method that uses three input vectors of different granularities to represent sentences.A Chinese word is mapped into stroke n-grams and the word vectors are trained by a Skip-gram model to obtain the corresponding stroke vectors.The convolutional embedding of the pinyin is obtained by encoding the pinyin of the Chinese word.(2)For high-frequency and low-sense words such as "you" and "I",there is no uniform and efficient solution in general neural networks,and for these high-frequency and low-distinction words,this thesis designs a method to calculate the influence of words,i.e.the distinction.And by calculating based on the distinction degree,those words with greater influence are derived and used in the calculation of the attention mechanism.The feasibility of the distinction-based attention mechanism proposed in this thesis is verified through comparison experiments with methods such as dot product attention and self-attention.(3)Traditional text classification models with pooling layers suffer from information loss due to the use of max-pooling and other forms.In this thesis,a capsule network model incorporating the structure of Chinese characters is proposed,in which a capsule layer is used to associate text features.The vectors of different granularities of text are classified by the capsule layer.To address the three problems of insufficient extraction of deep semantic information such as the structure of Chinese characters and polysyllabic words,how to reduce the influence of low-discrimination words,and data loss in the pooling layer of the neural network,this thesis proposes a semantic enhancement algorithm that fuses structural information of Chinese characters and validates it on a Chinese text classification task.The model outperforms traditional models such as Text CNN and RNN in terms of experimental results,and also outperforms models that incorporate Chinese character features such as RAM and RAFG.The experimental results demonstrate that the semantic enhancement algorithm proposed in this thesis can understand Chinese texts more effectively.
Keywords/Search Tags:text representation, multigranularity, attention, text classification
PDF Full Text Request
Related items