Font Size: a A A

Research On Calculating Method Of Semantic Similarity Of Chinese Short Text Based On Glyph And Meaning

Posted on:2021-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:P Y ZhangFull Text:PDF
GTID:2428330611970877Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
The semantic similarity calculation of short text is one of the key technologies in natural language processing.The existing methods for Chinese have the following problems:In terms of character embedding,Chinese character glyph contains rich semantic information,and the dictionary contains the meaning information of each character,while the existing methods do not integrate the semantic information of Chinese character glyph and the meaning knowledge of dictionary.In terms of semantic similarity calculation of short text,the existing methods ignore the word position information and the different semantic contribution of words in text,the deep semantic information of the text cannot be learned better.In terms of computing performance,the existing semantic similarity computing model requires a large amount of computing resources,the computing speed is slow,and the requirements for computing equipment are high.In response to the above problems,the following work has been done:(1)Data sets related to Chinese character glyph and Chinese character meaning were constructed.Based on the crawler,the corresponding glyph images of 3,587 commonly used Chinese characters,such as Oracle,Jinwen,Xiaozhuan,Lishu,Simplified Chinese,and Traditional Chinese,were obtained.Based on the electronic version database of Xinhua Dictionary,12,867 basic meanings of 3,587 commonly used Chinese characters were extracted and processed.Provides data support for character embedding.(2)The character embedding method GnM2Vec combining glyph and meaning was proposed.This thesis fuses the character glyph and the basic character meanings in the Xinhua Dictionary,the glyph auto-encoder and the character meaning auto-encoder were constructed.The 512-dimensional character vector for each character is finally obtained,which provides character embedding for the semantic similarity calculation model.The character vector generated by GnM2Vec were evaluated by the experiments of neighboring character calculation,Chinese named entity recognition and Chinese word segmentation.The results show that:In the neighboring character calculation experiment,GnM2Vec is better than Word2Vec in the calculation results of high-frequency and low-frequency characters,which improves the stability of the character vector.In the Chinese named entity recognition experiment,the F1-score of GnM2Vec is increased by 0.83%on the test set compared to Word2Vec.In the Chinese word segmentation experiment,the F1-score of GnM2Vec is increased by 0.05%on the test set compared to Word2vec.(3)Based on character vector and Transformer,a semantic similarity calculation model of Chinese short text was constructed.The character vector generated by GnM2Vec is used to represent two input texts firstly,and than two identical transformer networks used to capture the deep semantic information in texts,finally through a series of operations such as multiplication,subtraction and square,the semantic similarity value of two texts is obtained.This model is compared with CNN based model,LSTM based model and Attention based model.The results show that the F1-score of the model we proposed is improved by at least 3%and 1%respectively on transitive test and replaceable test set compared with the other models.(4)Model compression and acceleration was realized.The knowledge distillation method is used to compress and accelerate the Chinese short text semantic similarity calculation model we constructed in this thesis.The results show that the number of parameters of the compressed and accelerated model is reduced by about 88.11%,the training speed is increased by about 86.82%,and the calculation speed is increased by about 82.38%,the F1-score on the transitive test set and replaceable test set is only reduced by 2%and 1%respectively compared with the original model.(5)A knowledge question answering system in the medical field was implemented.The semantic similarity calculation model constructed in this thesis is used to realize the automatic question answering function of medical domain knowledge,and the accuracy of question answering is evaluated.The results show that the semantic similarity calculation model constructed in this thesis is 7%higher than that of the CNN based model,and 2%higher than that of LSTM based model and Attention based model.
Keywords/Search Tags:Character Glyph, Character Meaning, Character Embedding, Semantic Similarity Calculation
PDF Full Text Request
Related items