Font Size: a A A

Research On Text Similarity Algorithm Based On WMD Distance

Posted on:2020-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:X X XuFull Text:PDF
GTID:2428330596486196Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the rapid rise of AI technology,artificial intelligence and the ensuing mass of text data also put forward higher requirements for natural language processing.As a basic task in the field of natural language processing,text similarity is widely used in search engine,QA system,machine translation,text classification,spelling correction and other fields.As an important way to carry semantic information,traditional text representation uses vector space model to express semantic information.This way does not take into account the order of feature words and context semantic understanding,resulting in high-dimensional sparseness and low computational efficiency.WMD distance algorithm utilizes the semantic information in Word2 vec to achieve high accuracy of semantic cooccurrence,and can mine the semantic correlation between independent words.Therefore,the main work of this paper is based on WMD distance algorithm.Two improved algorithms are proposed to fully mine the valuable feature items in text and to combine the linguistic knowledge structure in knowledge dictionary and dependency parsing.The main work of this paper is as follows:1.Based on the problem that the WMD distance algorithm has simple word frequency weight to extract text features and cannot utilize semantic information effectively,this paper proposes WMD-JCS(Word Mover's Distance-Joint Character and Sentence)algorithm.The improved algorithm replaces the original word frequency weight with the use of TF-IDF coefficients,part of speech and physical location as new text features,and adds these features to the algorithm with reasonable mathematical formulas;secondly,the trained word vectors are used to construct sentence vectors in an unsupervised way to take full account of the context of semantics;lastly,the selected key words are filtered out.The keyword vectors and sentence vectors are involved in calculating the improved distance formula.Experiments show that the improved algorithm can effectively improve the accuracy of text similarity compared with WMD distance algorithm.2.Based on the above-mentioned improved WMD-JCS algorithm,another improved algorithm WMD-WSA(Word Mover's Distance-Word Sense Analysis)is proposed.Due to the poor semantic interpretability of deep learning and the inability of WMD-JCS algorithm to fuse deep semantic relevance information,the algorithm is based on HowNet and dependency parsing.First digging the semantic information of vocabulary from the perspective of linguistics to calculate the similarity between words and sentences.Then,the similarity is transformed into the transfer cost between words and sentences,and the distance formula is improved.Experiments show that the algorithm achieves higher accuracy,recall and F1 value,and further improves the accuracy of text similarity calculation.
Keywords/Search Tags:text similarity, word mover's distance, weighted coefficient, HowNet, dependency parsing
PDF Full Text Request
Related items