Font Size: a A A

The Study Of Measures And Applications Of Short Text Semantic Similarity

Posted on:2015-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:T T ZhuFull Text:PDF
GTID:2268330431959087Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text semantic similarity measures the degree of semantic equivalence between two texts, which plays an important role in natural language processing (NLP), and is a basis of many downstream applications.Previous research work have proposed many kinds of similarity measuring features, and proved that using multiple kinds of features achieves better results than using single kind of features. Therefore, the main work of this paper is to propose and combine more diverse similarity measurement features, which are expected to contain more complete text information and to improve the performance of short text similarity measuring model.We first present a sentence level short text similarity measuring model by combining diverse similarity measuring features. This model combines7different kinds of text similarity measuring features, i.e., string features, knowledge based features, corpus based features, syntactic features, machine translation based features, multi-level text features and other features, and our feature set is also the most complete currently. Then a supervised machine learning-based regression algorithm is used to build the model. The experimental results showed that combining diverse similarity measuring features improves the performance of short text similarity measuring model.Previous work seldom focused on cross-level semantic similarity. The second work of this paper is to extend the short text similarity work from sentence level to cross level with the aid of a latest released benchmark dataset regarding cross level text similarity measurement. Specifically, we build four similarity measuring models on four cross levels, i.e., paragraph-sentence level, sentence-phrase level, phrase-word level and word-sense level, respectively. The experimental results on corresponding datasets show that the performances of models decrease as the levels of texts decrease from long texts to word. The possible reason is the more information the long text contains, the better performance the model gets, and vice versa. To address the missing information problem in phrase and word, we propose a new method to extend the information with the aid of WordNet. The experimental results proved that our proposed information extending method improves the performance.To validate the effects of our proposed short text similarity measurement model, we applied it to two NLP tasks: paraphrase recognition and text entailment. The experimental results on paraphrase recognition is good, which means our model is able to serve for this task. However, the result of text entailment is much worse than our expectation but still can serve as a baseline for the text entailment task.
Keywords/Search Tags:short text semantic similarity, cross level text similarity, similarity features, machine learning, regression algorithm
PDF Full Text Request
Related items