Font Size: a A A

Research On Text Representation Model And Similarity Calculation Algorithm

Posted on:2021-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:J JiangFull Text:PDF
GTID:2428330611470907Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text representation and text similarity computation are the most important tasks in the natural language process,and they can provide technological supports to the computation of subsequent tasks.In this thesis,the sentence embedding model and the text similarity computation algorithm are researched.The main contents of the thesis are as follows:1.Aiming at the insufficient problems of the semantic information in the sentence embedding,a model is proposed based on feature contribution to represent sentence.The model introduces an improved information gain formula before computing sentence embedding,which combines the intra-and inter-class word frequency to construct a feature contribution factor.That is used to remove feature words with low contribution to the task.Finally,a sentence embedding with more accurate information is obtained.The experimental results show that the sentence embedding model proposed get higher accuracy on the two basic tasks not only text classification but also text similarity calculation,which verifies the effectiveness of the model.2.Most similarity algorithms only consider either the semantic information or the structural information of the text to compute the similarity.The thesis proposes a multi-model weighted fusion text similarity computation method,which aims to improve the accuracy of the text similarity algorithm by combing the advantages of multiple similarity algorithms.Firstly,on the basis of the word mover distance algorithm,the thesis constructs multi-featured fusion weights to further mine the semantic and context information of the text,and proposes a text similarity algorithm based on multi-featured fusion weights.Secondly,the hierarchical IIG-SIF similarity algorithm is used to employ the spatial structure information in the text.Finally,a linear-weighted model is established to combine these two similarity computation results,which effectively improves the accuracy of the text similarity algorithm.The controlled experiment shows that the algorithm can effectively improve both the Word Mover Distance algorithm and the IIGSIFSim algorithm,and is superior to the classic algorithms.The method can effectively extract the semantic information of the text,and find the relationship between the word order and the spatial structure in the text,and improve the accuracy in the text similarity.
Keywords/Search Tags:Text Similarity, Sentence Embedding, Multi-featured Fusion, Word Mover Distance, Multi-modeled Fusion
PDF Full Text Request
Related items