Font Size: a A A

Research On Document Distance Calculation Based On Word Embedding And Its Application

Posted on:2018-10-13Degree:MasterType:Thesis
Country:ChinaCandidate:J YanFull Text:PDF
GTID:2348330518487200Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The Internet has become part of everyday life, and it is very convenient to access information for people's living, working and studying. At the same time, a large number of text data are produced when people surf the Internet. How to effectively and timely extract important information from the massive and complex text data? The problem is eager to be solved by the related techniques of Natural Language Processing (NLP). However, document distance is the footstone of many NLP techniques.Document distance (or document similarity) is always one of research spots in NLP, and plays an important and fundamental role in document-level-oriented NLP applications, such as document classification and document clustering. Traditional models based on bag of words are simple and effective for calculating document distance, but they don't consider the latent semantic distance between different words,so they have great limitations. Recently, a novel distance function Word Mover's Distance (WMD) based on word2vec was proposed, which integrated semantic distance between different words. To some extent, WMD can ease the shortage of traditional methods, and can calculate document distance more accurately, which indirectly improves the application of related tasks. However, WMD just uses term frequency for its saliency in a document, which doesn't take term specificity in account and doesn't dampen large term frequency. In addition, there are no substantial differences between the similarity of the most similar term and the later similar term to a given term. Because of these, two aspects of document distance are explored, and the main contributions of this paper are as follows.First, we present two improved methods based on WMD for calculating document distances. One is that uses weighted statistic of words for their specificity and then uses log function to normalize TF and TF-IDF respectively. The other is that uses the Sigmoid function to transform the distance between word vectors in order to achieve better discrimination. Experiments on five SemEval data sets show that the two proposed methods have improved over WMD in terms of correlation coefficient.Second, we apply the proposed methods to document classification and cross-media information linking. In document classification,we integrate our improved document distance methods into KNN classification algorithm. Experiments conducted on eight public data sets shows that, compared with WMD, out proposed methods reduce test error rate on all most datasets. In cross-media information linking,we propose a linear regression model combining improved document distance and the time of social media. A direct comparison on the authoritative data set also indicates our proposed linking models are at least comparable to the state-of-the-art approach Weighted Textual Matrix Factorization (WTMF) in terms of the evaluation measure ATOP.In a word, according to the experimental results above, our proposed methods for document distance are feasible and effective in real application.
Keywords/Search Tags:Word Embedding, Document Distance, WMD, Text Classification
PDF Full Text Request
Related items