Font Size: a A A

Research And Improvement Of Text Similarity Detection Based On Simhash

Posted on:2019-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:X X WangFull Text:PDF
GTID:2438330563957628Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The internet is changing fast,and a large amount of information merges with our lives every day.As the appearance of search engines,users can get what they want in the vast cyberspace by using short descriptions.While search engines obtain information from internet,recognizing repetitive or similar web data effectively is a very important function for the web data collection..Similar texts' searching is also the indispensable matter of data mining and knowledge discovery.The increasing demand of intellectual property protection needs it to find the plagiarism,too.For the efficient and accurate similar texts' searching,we researched some similar texts searching methods and their characteristic are analyzed in this paper,and Simhash algorithm is chose to achieve the research goal of rapid similar texts' searching with abundant texts data.We have improved feature extraction of similar texts,with the available word segmentation tools for Chinese word segmentation,generated Simhash fingerprint by texts choosing paragraph as an unit to build the relationship between texts and digital fingerprint with ignoring stop words and TF-IDF algorithm.Specific to similar fingerprint searching,we use fingerprint subsection to build reverse index for making it faster.As a result,we perform some Simhash algorithm experiment with the big text database in order to test the efficiency of similar texts' search,where the recall rate and precision ratio of the experiment are recorded.We find that the similar texts finding is quick and.With these experiments and from result analysis,we find some imperfection when they are used for short text or uncertain position text.For these problems,we propose a definition of locality short text,and design a recognition method of it to find the potentially similar text.We improve the edit distance calculation method to measure the level of short text similarity.Finally,we improve our similar texts' search method with combination of Simhash and locality short text recognition.We get the result giving consideration to both recall rate and precision ratio and proved that our method is good,which means the proposed algorithm has practical application value in the similar texts' searching field.
Keywords/Search Tags:Simhash, similar text, quick search, locality short text, edit distance
PDF Full Text Request
Related items