Font Size: a A A

Research On Chinese Text Similarity Detection Technology Based On Word Weight Analysis

Posted on:2022-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ChenFull Text:PDF
GTID:2518306326966119Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Under the era of information age nowadays,the amount of information on the Internet sustains a dramatic growing,since more and more netizens join the tide of Internet information production and dissemination.Among these information,especially textual data,is often observed holding similar or even same core content,for documents are widely copied,modified,formatted,replaced by synonyms during disseminating.Problems therefore caused by wasting time in calculation,file storage and retrieval tasks lead to a great negative effect on Internet information quality and dissemination.In order to reduce the waste caused by duplicate documents,an efficient and accurate similar document retrieval technology is necessary.Simhash is one of the common technologies in similar document detection tasks,which own the ability to map a text into a low-dimensional token.The similarity between any two documents can be judged by comparing different Simhash tokens.However,the original intention of the Simhash is to detect near-duplicates for web crawling,without considering the semantic information of each word in the text expression process.Therefore,the expression accuracy of the original Simhash token holds a problem of deficiency.In order to improve the accuracy of similar text detection,this thesis studies and improves the word weight calculation strategy and the similar text detection algorithm.This thesis makes main contributions as follow:(1)This thesis proposed a word similarity algorithm that integrates How Net and Ci Lin,in order to solve the problem while single-knowledge-based Chinese word similarity algorithms own incomplete information.Based on word similarity algorithm applicated to the word weight calculation strategy,we integrate two algorithms dynamically according to the distribution of words on the basis of the existing outstanding word similarity algorithms based on How Net IC and Ci Lin IC,which makes full use of the hierarchical structure information in How Net and Ci Lin to improve the limitations of existing methods.(2)Focusing on the problem that traditional TF-IDF has poor accuracy in displaying the importance of words in a text,this thesis proposed an improved calculation method of word weight.The traditional TF-IDF algorithm neglects the characteristics of words themselves but only considers the frequency of the words in the text and document sets.After analyzing the expression habit of Chinese users and the semantic information contained in words,an algorithm integrates the length,the part of speech,the positions,and title matching of the word is proposed in this thesis.(3)Aiming at the shortcomings of Simhash on similar text detection performance,we applied a multi-feature word weight algorithm to improve the generation process of Simhash token.This method applied the new word weight calculation method to generate Simhash token which maps a text in document sets into a low-dimensional token,then compare the similarity between different Simhash tokens to make the final result of similar document detection.Compared with the traditional Simhash,our method improves the precision and recall in Chinese text similarity detection tasks.
Keywords/Search Tags:word similarity algorithm, word weight, TF-IDF, Simhash, Text Similarity
PDF Full Text Request
Related items