Font Size: a A A

The Research On Measuring Text Similarity Based On Word Vector Enhanced Tree Kernel Model

Posted on:2020-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhuFull Text:PDF
GTID:2428330620951103Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of information,how to effectively mine useful information from a large amount of information has become an important issue.Text is an important carrier of information,processing and analysis of text has b ecome one of the hotspots of data mining.Text similarity is the basis of most text-related tasks in natural language processing(NLP),such as information retrieval and question answering systems,accurate text similarity calculation is of great significa nce for text processing.Text similarity is a method of calculating the degree of matching between a pair of texts.Most of the existing research methods focus on the semantic similarity of texts,or weighted and stacked multiple types of similarity features(such as lexical,syntactic,semantic).Unlike the current approaches,this thesis considers the syntactic features of text,incorporates semantic information,and proposes a text similarity calculation method based on word vector enhanced tree kernel model(VTK).The main work of this thesis is as follows:In order to fuse the syntactic and semantic features of text,a text similarity calculation model(VTK)based on word vector enhanced tree kernel is proposed.Construct a high-quality word vector as the semantic knowledge resource of the method,then construct a syntactic tree of the text by parsing the text.Based on the tree kernel method to calculate the number of common subtrees of two text trees,we define a new rule for subtree matching: treat different types of nodes(syntax and words)in the text tree differently,and use the word vector of the word node as its label.Thereby,the integration of semantic information is realized,and the automatic extraction an d matching of the syntactico-semantic features of the text is completed.Finally,the similarity score between the texts is obtained by matching scores between the features.In order to solve the influence of the high correlation of the antonym inherent in the word vector on the performance of the algorithm,the WordNet-based antonym tag filtering method is further proposed.We modify VTK rules for subtree matching,increase the judgment of the antisense relationship of the word pairs based on th e antonymy in the WordNet dictionary,mark the word node pairs with antisense relationship,then implement the antonym filtering by setting the feature score to zero when matching the antonym node pairs to improve the accuracy of the similarity judgment.Finally,this thesis conducts experiments on 19 datasets stemming from a wide variety of sources provided by the semantic textual similarity(STS)task from 2012 to 2015,and evaluate its performance using the widely used benchmark: the Pearson correlation coefficient.The proposed method has the state-of-the-art performance compared with the benchmark methods in text similarity,and the experimental results show that it can effectively improve the accuracy of sentence similarity judgment.This also proves that the combination of syntax and semantic features is a good choice for short text similarity computation modeling.
Keywords/Search Tags:Natural Language Processing, text similarity, syntactic analysis, word vector, tree kernel
PDF Full Text Request
Related items