Font Size: a A A

Research On Semantic Textual Similarity Model Based On Conceptual Information Content

Posted on:2019-11-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:H WuFull Text:PDF
GTID:1488306470993499Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Given two sentences or snippets of text,semantic textual similarity(STS)denotes the degree of semantic equivalence.STS benefits a wide range of text-related applications in natural language processing and has important research significance and application value.The existing STS methods are mainly based on word surface features,however,conceptual associations are ubiquitous among words and the lack of conceptual level calculation leads to difficulties in improving the accuracy of these methods.There is no valid model for the exact calculation of the full text at the conceptual level.In this thesis,based on the conceptual information content(IC)of single nouns proposed by Resnik,we propose a variety of models and methods to quickly and accurately calculate the semantic similarity at the conceptual level.The main work and contributions of this dissertation are summarized as follows:(1)Propose a basic STS model based on the conceptual ICTo solve the problem that the existing STS methods are limited to word surface features,this thesis proposes an STS model of conceptual level,i.e.,making use of the proportion of the common IC of two sentences in the total IC of them to measure the semantic similarity.We construct the concept space which is generated from the noun concepts with their IS-A relations in Word Net,then define the common IC and the total IC of multiple concepts and exploit the inclusion-exclusion principle in combinatorial mathematics to calculate the textual IC.For the first time,the conceptual IC of a single word is extended to that of texts.The experimental results on Li et al.'s dataset show that the proposed model which only makes use of the nouns in texts is superior to the traditional unsupervised models.(2)Propose a computational method of textual IC based on conceptual information gainTo solve the problem that the time complexity is too high to calculate the textual IC by the inclusion-exclusion principle method,this thesis proposes a computational method of textual IC based on conceptual information gain.The method utilizes the characteristics of the concept space and employs an incremental approach,i.e.,the textual IC is obtained by accumulating the conceptual information gain of each newly added concept,so as to avoid the inevitable problem of repetitive computation which is caused by the inclusion-exclusion principle method.In this thesis,a theorem system is established to derive the formula based on conceptual information gain.Then we design the corresponding efficient algorithms.The algorithm analysis shows that the time complexity reduces from higher than O(2~n)to O(n~2).The superior performance of the algorithms is also validated by the experimental results.This indicates that the algorithms can quickly calculate the textual IC for long texts.(3)Propose a computational method of full textual IC with informational weight integrationTo solve the problem that only nouns can be involved in the IC calculation which leads to the lack of textual information,this thesis proposes a computational method of full textual IC with informational weight integration to improve the computational accuracy in STS.The improvement of the model is manifested in three aspects:1)Derivational links in Word Net is employed to associate verbs,adjectives,and adverbs with their corresponding noun concepts,so as to map all content words to the noun concepts;2)The conceptual information gain of out-of-vocabulary named entities(OOV NEs)is predicted by the tendency learned from known NEs,so as to prevent information loss by OOV NEs;3)We integrate informational weights in conceptual IC,so as to accurately fit human subjective evaluation of the semantic similarity.Compared with the basic one without the above three improvements,the accuracy of the comprehensive model achieves a significant promotion:On the datasets of Sem Eval 2013-2016 STS tasks,the experimental results of the comprehensive model outperform the evaluation results of the state-of-the-art systems of each year.At the Sem Eval 2017 STS task,our team with the comprehensive model ranked the2nd among all the competing teams and the 1st on Track 1 dataset(34 participating teams submitted 81 evaluation systems).The paper of the comprehensive model was awarded by Sem Eval as“Best of Sem Eval 2017”.In summary,this thesis makes a deep research on STS models based on conceptual information content.Through a lot of experiments and proofs,it provides a calculation model of conceptual level for the study of STS with remarkable improvements in accuracy and performance.
Keywords/Search Tags:semantic textual similarity, conceptual information content, the inclusion-exclusion principle, conceptual information gain, WordNet, out-of-vocabulary named entity, conceptual information weight
PDF Full Text Request
Related items