Font Size: a A A

Research On Semantic-based Text Similarity Calculation Method

Posted on:2019-08-20Degree:MasterType:Thesis
Country:ChinaCandidate:R N LiFull Text:PDF
GTID:2428330593450293Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In people's interaction with the network,text is a more common form.With the popularization of the Internet,people have produced vast amounts of data on the Internet.How to extract effective information from these massive data has become a research hotspot.Text similarity technology is a key technology in data mining.External text similarity technology also plays an important role in the fields of information retrieval,question answering system,document checking,document classification and clustering.It is of great significance to improve the accuracy of text similarity.The common text similarity calculation methods are studied,and the related problems in the current text similarity calculation methods are summarized.The common text similarity calculation methods are mainly literal distance measurement and topic level measurement.The calculation of similarity between literal distances from the most intuitive literals cannot measure the semantics of the two documents.The measures at the subject level mainly include vector space models,latent semantic analysis models,and article topic generation models.The main problems are: It does not consider the semantic information of words,and the similarity measure cannot be used between words.In addition,when there are too many feature items in the vector space model,problems such as sparse matrix,polysemy of one word,and multi-word of one sense are easily generated.The dimensionality reduction adopted is a purely mathematical transformation,which can explain line deviations.The article topic generation model is suitable for long texts with rich semantic meaning and has poor effect on short sentences.The content of text research is mainly divided into two parts: one part is to study the similarity between sentences,and the other part is to study the similarity between documents.For the calculation of similarity between sentences,based on the characteristics of LSTM neural network applied to sequence data,a neural network model for sentence modeling is designed.By training the model,the model is used to learn sentence sequence,grammar and other information.Finally,the sentence can be represented by a vector,and then the similarity of the sentence can be measured by the cosine similarity between the vectors;for the similarity calculation between documents,since the sentence is a basic unit that can completely express the semantics,it is also grammatical.Specifically,based on word2 vec and Earth Mover's Distance(EMD)algorithm,a sentence-based document similarity calculation method is proposed.The method first treats two documents as sentences with two distributions of feature quantities,and then calculates two distributions.The similarity between the two is expressed in the document as the minimum value of the sum of the weighted costs required to calculate the "move" of all the sentences in a document to another document,where "move" refers to the weight of the sentence.Semantic distances are assigned to all sentences in another document.The sentences are based on LSTM-designed neural networks.To eventually transformed into solving linear programming problems with constraints.Experiments on a sentence-level dataset and three document-level datasets show that at the sentence and document level,the accuracy of the proposed method is significantly improved compared to the traditional method.
Keywords/Search Tags:text similarity, LSTM, semantics, word2vec, linear programming
PDF Full Text Request
Related items