Font Size: a A A

Research Of Copy Detection For Chinese Text

Posted on:2010-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:X K LuFull Text:PDF
GTID:2178330338975831Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the information society, rapid development of computer technology, communication technology and network technology make the network become an important way to obtain information. It is common sense that in the near future, online media will replace the print media and become main information resource. With the explosive growth of internet information, how to obtain quickly required information becomes an important problem.To address above problem, search engine technology appeared and is developing quickly. However, current search engine technology is not satisfactory because a large number of duplicate query results are returned in the collection pages. The duplicate query results are produced because of the inter-reproduced web sites which makes users face not only increased information retrieval difficulty, but also a waste of storage space. Therefore, the detection of duplicate pages is a meaningful task to avoid duplication of storage and make information retrieving more quickly and accurately. On the other hand, in the internet e-commerce environment, digital productions can easily be illegally copied and spread, which will undoubtedly hinder the healthy development of e-commerce. Copy detection technology, to a certain extent, can help to resolve these issues. Currently, there is no satisfactory solution to many problems for the copy detection technology of Chinese text.In this thesis, firstly, a brief review on the text copy detection technology is presented together with a study of the related technologies. The Chinese text pre-processing includes chunk segmentation, feature extraction, text similarity measure and text copy detection, etc.. The traditional copy detection algorithm, including algorithm performance, advantages and disadvantages are analyzed. Then, this thesis focuses on studying the Chinese text copy detection algorithm and proposes two improved algorithms.Although the traditional Chinese text copy detection methods based on n-gram can avoid Chinese word segmentation process, but it is not satisfactory in the text feature extraction. Our proposed n-gram-based text copy detection method, combining the n-gram method and sliding window technique, makes a small amount of text feature extraction which can have a more accurate calculation of text similarity and thereby improve the algorithm efficiency. Experiments show that the proposed method is effective with a relatively promising recall rate and precision rate.This thesis presents a sentences-comparison-based text copy detection method. The method adopts sentence-document multi-index storage structure, so all documents contain the query sentence can be found by the sentence when document copy detection process is performed. The algorithm firstly divides text by original sentences, then word segmentation on every sentence is performed, and the noun sequence in a sentence is extracted as the sentence's features. The following step is to construct a sentence-document multi-layer index structure by making use of the noun sequences. Finally the similarity between texts is calculated using the longest common subsequence algorithm. According to the calculated similarity value between two texts, we can determine the duplication degree of two compared texts, as well as the existence of the duplication between them.Finally, the test corpus labeled manually is used to test two improved text copy detection methods. The test results are analyzed with precision and recall rates, and applied to evaluate the proposed two copy detection methods. The results show that the two improved methods in this thesis have promising copy detection effects.
Keywords/Search Tags:copy detection, sliding window, sentence comparison, multi-index, Chinese information processing
PDF Full Text Request
Related items