Font Size: a A A

Chinese Text Plagiarism Detection Algorithm Based On The Double Feature Extraction

Posted on:2014-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Q XuFull Text:PDF
GTID:2248330398482551Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent decades, with the rapid development of information technology and network, the way people access to information has been from a large number of physical media into web documents. This development has brought people convenient and also plays a negative role in the development of technology to our life. Compared to traditional documents, electronic documents more easily by illegal copying and plagiarism phenomenon and text appear in many areas, such as academia, business. This phenomenon has been very serious. In order to maintain normal teaching order, protection of intellectual property and curbing the spread of plagiarism, text plagiarism detection technology research is of great significance. Current text plagiarism detection in the field of research more effective detection systems have Siff, COPS and China hownet detection system, but the universal problems of detection accuracy is not high.Chinese text text plagiarism detection of the main idea is:first, the text preprocessing, including removing the text information has nothing to do with the text detection and duty. The second is to extract text feature. The final is calculation the similarity of the text with source library text similarity. If the similarity value is higher than the predetermined threshold value, it shows that the text under test was suspected of plagiarism. Text preprocessing and feature extraction is important and difficult text plagiarism detection research. Text will revolve around these two aspects to do the following three aspects of research:1, text preprocessing:at present, mostly for Chinese text plagiarism detection method for simple text processing, not considering the single words and words of Chinese text feature, which can lead to text feature extraction is not comprehensive, the detection accuracy is not high. Aimed at this problem, this paper puts forward a combined whole word text pretreatment method, after the text participle, according to the before and after each word of semantic relations, combined with the overall meaning of word, as a text preprocessing results. Experiments show that, after the merger the whole text after the word, can decrease The Times of later in this paper, calculation for the extraction of feature extraction provides a better solution, so as to improve the detection accuracy.2, text feature extraction, feature extraction is to be selected to represent the text characteristic of text block. Selected text block is required to represent the text characteristic of information, including information, semantic information and certain structure make the text plagiarism detection accuracy is high. As far as possible, but at this stage of the extraction method and extraction of features is not complete and the quantity is too much, more computational times and higher time complexity problem.Aiming at such problems, we propose will be double feature extraction after preprocessing of the texts, to improve the accuracy of the feature and characteristic length. Mainly adopts digital fingerprint to represent text information, convert all of the text into digital fingerprint collection, statistics of various fingerprints appear frequency, and the fingerprint collection by using the statistical similarity calculation methods for matching similarity calculation. Experiments show that the feature extraction method to extract can accurately represent the characteristics of a text, and the length is moderate.3, Chinese text plagiarism detection based on the double feature extraction methods, we proposed the merger of the whole word were used respectively to processing text pretreatment method and the double feature extraction method to extract the feature, implementation is based on double feature extraction method of Chinese text plagiarism detection. Experiments show that the accuracy and recall rate of detection method are improved obviously.
Keywords/Search Tags:text plagiarism detection, text pre-procession, second feature extraction, text similarity, fingerprints
PDF Full Text Request
Related items