Research On Text Similarity Algorithm Based On WMD Distance

Posted on:2020-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:X X Xu

Full Text:PDF

GTID:2428330596486196

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid rise of AI technology,artificial intelligence and the ensuing mass of text data also put forward higher requirements for natural language processing.As a basic task in the field of natural language processing,text similarity is widely used in search engine,QA system,machine translation,text classification,spelling correction and other fields.As an important way to carry semantic information,traditional text representation uses vector space model to express semantic information.This way does not take into account the order of feature words and context semantic understanding,resulting in high-dimensional sparseness and low computational efficiency.WMD distance algorithm utilizes the semantic information in Word2 vec to achieve high accuracy of semantic cooccurrence,and can mine the semantic correlation between independent words.Therefore,the main work of this paper is based on WMD distance algorithm.Two improved algorithms are proposed to fully mine the valuable feature items in text and to combine the linguistic knowledge structure in knowledge dictionary and dependency parsing.The main work of this paper is as follows:1.Based on the problem that the WMD distance algorithm has simple word frequency weight to extract text features and cannot utilize semantic information effectively,this paper proposes WMD-JCS(Word Mover's Distance-Joint Character and Sentence)algorithm.The improved algorithm replaces the original word frequency weight with the use of TF-IDF coefficients,part of speech and physical location as new text features,and adds these features to the algorithm with reasonable mathematical formulas;secondly,the trained word vectors are used to construct sentence vectors in an unsupervised way to take full account of the context of semantics;lastly,the selected key words are filtered out.The keyword vectors and sentence vectors are involved in calculating the improved distance formula.Experiments show that the improved algorithm can effectively improve the accuracy of text similarity compared with WMD distance algorithm.2.Based on the above-mentioned improved WMD-JCS algorithm,another improved algorithm WMD-WSA(Word Mover's Distance-Word Sense Analysis)is proposed.Due to the poor semantic interpretability of deep learning and the inability of WMD-JCS algorithm to fuse deep semantic relevance information,the algorithm is based on HowNet and dependency parsing.First digging the semantic information of vocabulary from the perspective of linguistics to calculate the similarity between words and sentences.Then,the similarity is transformed into the transfer cost between words and sentences,and the distance formula is improved.Experiments show that the algorithm achieves higher accuracy,recall and F1 value,and further improves the accuracy of text similarity calculation.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Chinese Verb Metaphor Recognition And Application Based On Semantic Knowledge
2	Research On Text Representation Model And Similarity Calculation Algorithm
3	Research Of Comprehensive Weighted Word Semantic Similarity Computation
4	Research And Implement On Chinese Dependency Parsing
5	Research And Implementation Of Text Similarity Computing Based On HowNet Sememe Space
6	Research Of Chinese Word Sense Disambiguation Based On Hownet
7	Word Sense Disambiguation Research Based On Dependency Parsing
8	Research On Chinese Text Similarity Computing Based On Semantic Weighted
9	The Research And Application Of Unsupervised And Supervised Short Text Similarity Measure
10	Research On Mongolian Dependency Parsing Based On The Conversion Of Chinese-Mongolian Dependency Parsing Tree