Font Size: a A A

The Research On Document Copy Detection Technology Based On Chinese Character Component Histogram

Posted on:2016-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:L Q JiangFull Text:PDF
GTID:2308330470977064Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text copy detection technology is one of the applications of text similarity,it plays an important role in the Web rechecking, intellectual property protection,search engines,digital library etc.But for the Chinese document,the text copy detection technology starts relatively later, and due to its complexity, Chinese text copy detection technology is more difficult to achieve.As a result,the field is worth of researching and discussing.This article firstly has a quite detailed discussion in the two types of copy detection algorithms,which in the basis of the characters matching and the word frequency statistics respectively.Meanwhile,the characteristics and disadvantages of existing algorithms also be inductived and summarized.For most of the text feature vector expressions used to have high-dimensional and sparse problem,the similarity calculations are very complicated, and the resource utilizations are low,this paper proposes a new text copy detection model which based on Chinese character component histogram. This paper’s main work is as follows:1) The idea to extract the Chinese character component histogram as the text fingerprint feature has been proposed.Firstly, the Chinese characters in texts are splited into components by the structure of the Chinese characters and the Mathematical Expression of Chinese Characters.Then the frequency of each component is calculated to construct the component histogram map.The component number is used as the abscissa, and the probability of each component is used as the ordinate.Finally,we identifiy the component histogram as the fingerprint characteristics of the text.2)The matching distance between component histograms has been used as the judgment standards of the text copy detection system.This paper designs four distance matching formulas for the histogram distance calculation. The experiment finally selects Bhattacharyya distance as the most appropriate calculation formula to measure the histogram similarity.3) Quite a few data is collected for the simulation and realization of the algorithm.The experiment data is consists of 400 entry documents,According to the experiment results,the text copy detection algorithm based on Chinese character component histogram can get a good precision, recall and F1. Furthmore the results of the comparison experiment show that the new method has the advantage in the time complexity and space complexity,also can get more desireable Fl.
Keywords/Search Tags:Chinese text, Similarity, Copy detection, Components histogram, Histogram distance
PDF Full Text Request
Related items