Font Size: a A A

Research And Implementation Of Document Similarity Based On Word2vec

Posted on:2017-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:D J WuFull Text:PDF
GTID:2348330488974137Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the age of fast development of Internet,especially the usage of Web3.0,there are more and more information being encoded and stored on the Internet.To classify and search more and more large scale of documents, we must store and offer an index efficiently.We need to calculate the similarity among documents to classify them.Among all the possible ways,one of which is to contrast each one by people themselves, and the result must be exactly correct.But considering the quatity of papers and the speed it rise at,human's energy seem so poor.Therefore, we need to develop a kind of application which is consilient to do this job precisely and quickly.Document Similarity is the base of many applications including Document Clustering, Search Engine and Paper Similarity,so it is definite to say that Document Similarity plays a important role on these applications.By improving the accuracy of Document Similarity,it is more easy to classify a great number of information,and make robot perform like as humnan.The traditional algorithm did well in the past years,but it don't today because the growth of information.There are two advantages with traditional algorithm.Firstly,it can't recognite two words with similar meaning but different spelling.Secondly,it regards all words in the text as equally important,and this is not correct.To slove these problens,researchers from all over the world work hard and achieve many satisfying results among years.One of best algorithms which work well is Word2 vec,which regards a word as a vector.When calculating the similarity between words,it is fact that calculating the cosine distance between whose vectors.By including the Word2 vec,it is available to recognite words with similar meaning but different spelling.Moreover,this paper introduces the concept of word frequence,which can distinguish words with different importance int the text.
Keywords/Search Tags:Document Similarity, VSM, Distributed Representation, Word2vec, Information Retrival
PDF Full Text Request
Related items