Font Size: a A A

Research On Semantic Similarity Calculation Of Chinese Short Text

Posted on:2020-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:F Y MaFull Text:PDF
GTID:2428330590459365Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Chinese short text semantic similarity calculation research plays an important role in the field of natural language processing.Existing methods have the following problems:In the level of word semantic representation,Chinese characters and words usually have multiple meanings.However,the word vectors obtained by the current methods cannot contain all the meanings of words.At the level of computational model,the existing similarity calculation methods cannot capture the dependence of the vocabulary in the text and the contribution of the internal structure to the semantics of the text;moreover,they consider that the word vector is unique and can not choose different word vectors according to the context.In view of the above problems,the following work has been done:(1)The Chinese character related information data set and the vocabulary related information data set are constructed:?Based on the web crawler,the pronunciation,paraphrasing,the five strokes,the five elements,basic definitions and detailed definitions of 20902 Chinese characters in the Modern Chinese Dictionary are obtained.The data set of 3587 common words contains 23,821 data.The data set of 3587 common words contains 23,821 word meaning texts.?Construct web crawler to obtain 48,392 basic definitions and 32,708 examples of 56,008 commonly used words from Baidu Chinese.Provides data support for word vector representation and word sense disambiguation.(2)The character and word sense vector model and the character and word vector model are constructed.The semantic description information of the characters in the moderm Chinese dictionary is used to obtain the character vector.A character-meaning vector model based on folly connected autoencoder is constructed,and each character meaning text is mapped into a 256-dimensional character meaning vector.A character vector model based on fully connected auroencoder is constructed to further semantically map the 64 character meaning vectors of each character to obtain a 256-dimensional character vector for each character,and an initialization vector is provided for the semantic similarity calculation model.The above model also applies to words.(3)A double sequence model for Chinese short text semantic similarity calculation based on multi-head self-Attention is constructed.The self-Attention can take into account that different words in a text contribute differently to its own semantics.This paper constructs a double-sequence model based on multi-head self-Attention and compares it with LSTM-based model and CNN-based model.The variance and product of the two results are added to the intermediate result to magnify the difference and similarity of the two texts.The three models were tested.The results show that the model based on multi-head self-Attention is superior to the other two models in overall performance.On small datasets(26 data),the F1 value based on the multi-head self-attention model is 32%higher than the other two models.(4)A Chinese short text semantic similarity calculation model based on word sense disambiguation is constructed.A word sense disambiguation model based on Seq2Seq is constructed to dynamically select word vectors according to context.In the SemEval-2007 Task#5 task,the Seq2Seq-based word-disambiguation model improved the disambiguation accuracy by 11.48%compared with the best of the other four word-disambiguation methods.The cosine similarity calculation is performed with the disambiguated word vector,and the accuracy can reach 72.37%.Compared with the cosine similarity calculation method based on word frequency,the accuracy is improved by 3.42%.(5)The short text semantic similarity calculation method is evaluated,and an examination system supporting the automatic review of subjective questions is constructed.The multi-head self-Attention double-sequence model constructed in this paper is used to judge subjective question,and the 575 student answers are scored.The Pearson correlation coefficient between the score obtained by the model and the real scores given by teachers is 0.6541,which is 0.2035 higher than the method based on word sense disambiguation.
Keywords/Search Tags:Chinese Short Text, Semantic Similarity Calculation, Double Sequence, Attention, Word Sense Disambiguation
PDF Full Text Request
Related items