The Research On Measuring Text Similarity Based On Word Vector Enhanced Tree Kernel Model

Posted on:2020-06-15

Degree:Master

Type:Thesis

Country:China

Candidate:L Zhu

Full Text:PDF

GTID:2428330620951103

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the explosive growth of information,how to effectively mine useful information from a large amount of information has become an important issue.Text is an important carrier of information,processing and analysis of text has b ecome one of the hotspots of data mining.Text similarity is the basis of most text-related tasks in natural language processing(NLP),such as information retrieval and question answering systems,accurate text similarity calculation is of great significa nce for text processing.Text similarity is a method of calculating the degree of matching between a pair of texts.Most of the existing research methods focus on the semantic similarity of texts,or weighted and stacked multiple types of similarity features(such as lexical,syntactic,semantic).Unlike the current approaches,this thesis considers the syntactic features of text,incorporates semantic information,and proposes a text similarity calculation method based on word vector enhanced tree kernel model(VTK).The main work of this thesis is as follows:In order to fuse the syntactic and semantic features of text,a text similarity calculation model(VTK)based on word vector enhanced tree kernel is proposed.Construct a high-quality word vector as the semantic knowledge resource of the method,then construct a syntactic tree of the text by parsing the text.Based on the tree kernel method to calculate the number of common subtrees of two text trees,we define a new rule for subtree matching: treat different types of nodes(syntax and words)in the text tree differently,and use the word vector of the word node as its label.Thereby,the integration of semantic information is realized,and the automatic extraction an d matching of the syntactico-semantic features of the text is completed.Finally,the similarity score between the texts is obtained by matching scores between the features.In order to solve the influence of the high correlation of the antonym inherent in the word vector on the performance of the algorithm,the WordNet-based antonym tag filtering method is further proposed.We modify VTK rules for subtree matching,increase the judgment of the antisense relationship of the word pairs based on th e antonymy in the WordNet dictionary,mark the word node pairs with antisense relationship,then implement the antonym filtering by setting the feature score to zero when matching the antonym node pairs to improve the accuracy of the similarity judgment.Finally,this thesis conducts experiments on 19 datasets stemming from a wide variety of sources provided by the semantic textual similarity(STS)task from 2012 to 2015,and evaluate its performance using the widely used benchmark: the Pearson correlation coefficient.The proposed method has the state-of-the-art performance compared with the benchmark methods in text similarity,and the experimental results show that it can effectively improve the accuracy of sentence similarity judgment.This also proves that the combination of syntax and semantic features is a good choice for short text similarity computation modeling.

Keywords/Search Tags:

Natural Language Processing, text similarity, syntactic analysis, word vector, tree kernel

PDF Full Text Request

Related items

1	Research On Semantic Technologies In Natural Language Processing
2	Natural Language Processing-A Study Of Vectorization Of Chinese Words And Short Texts
3	Sentence Similarity Computing Based On Semantic Tree Kernel
4	On Subjective Test's Automatic Scoring System In The Field Of Rail Traffic Signal Based On Ontology And Syntactic Structure Analysis
5	Investigation And Applications Of Conversion Between Syntactic Trees In Natural Language Processing
6	Research On Key Techniques Of Cross-Language Text Similarity Detection Based On Word Vector
7	Research On Text Classification Based On Natural Language Processing And Machine Learning
8	Research And Implementation Of Intelligent QA Enhancement System For Vertical Domain
9	Research On Natural Language Watermarking Based On Syntactic Transformations
10	Emotion Analysis In Natural Language Processing Based On Eye Tracker