Font Size: a A A

Copy Detection Method For Chinese Documents Based On Fingerprint And Semantic Knowledge Representation

Posted on:2011-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiFull Text:PDF
GTID:1118360302994396Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Copy detection technology for natural language documents is an important topic in theinformation processing field, and it is a powerful tool to protect the author's intellectualproperty and to improve the efficiency of information retrieval. Document copy detectionis to judge whether the given document plagiarize content of other ones in the database,which plagiarism occurs in some ways, such as by duplicating partial or total documentcontent, by using different words or sentences to express the same meanings of the text ofpervious documents in the database. On the basis of the theories of the existing documentcopy detection systems, copy detection method for Chinese documents is studied in thisdissertation. The proposed method uses fingerprint and semantic knowledge representationto find automatically the overlaps between Chinese documents.Firstly, functions, merits and defects of the existing systems are analyzed, and afingerprint-based copy detection method for Chinese documents is proposed. Accordingas the property of document copy detection, a detection granularity parameter and a noisegranularity parameter are defined. The Hash function is used to map two sequence of chunksthat have been eliminated noises to the set of the corresponding values. A window-basedalgorithm is proposed to extract fingerprints in the sequence of values and the overlap be-tween a query document and the ones in the database is calculated by the defined formula.The overlap of the query document is used to judge whether the one is a copy.Secondly, a word sense tagging method for Chinese full-words based on unsupervisedlearning is proposed. The correct senses of monoseme and classified polyseme are taggedby the dictionary definition of HowNet and the part of speech of the word. Based on theactual application, an improved word sense disambiguation method is used to tag the mostappropriate senses of the other kinds of polysemes in a particular context. The existingEM(Expectation Maximization) algorithm computes expensively and converges slowly. Toaddress the problems, mutual information theory based on Z-test is used to select featuresand a statistical learning algorithm is presented to estimate initial parameter values.Thirdly, an unsupervised Chinese parser based on probabilistic context free grammaris proposed. To address the limitations of probabilistic context free grammar, the context information is used and a new syntax structure probabilistic estimation function combinedthe co-occurrence information of part of speech and syntax category is introduced. Thesyntax parsing algorithm is described and Inside-Outside algorithm is used to obtain theprobabilities of the syntactic rules and the structure co-occurrences from the raw materials.The problem that build the larger scale Chinese treebanks is avoided.Finally, in order to conclude the meaning of a sentence by the syntax structure and thesenses of the substantives in the sentence, a frame-based semantic knowledge representationmethod is proposed. The representation can express the meaning of words, phases and sen-tences, and show distinctly the hiberarchies between the semantic units. The complicateddocument copy patterns, such as single-word synonym, voice, part of speech and break-ing long sentence, are found by the copy detection method based on semantic knowledgerepresentation, and the corresponding overlap measure method is presented.The experiments confirm the optimal values of the defined parameters and validatethe correctness and validity of the proposed methods. The fingerprint-based copy detectionmethod uses the string matching idea to find the overlaps between Chinese documents. Byanalyzing and matching the meanings of the sentences in a document, the copy detectionmethod based on semantic knowledge representation realizes really to find the overlapsbetween Chinese documents from the semantic level of natural language processing.
Keywords/Search Tags:Information Processing, Document Copy Detection, Fingerprint, Semantic Knowledge Representation, Semantic Analysis, Disambiguation, Expectation Maximization Algorithm
PDF Full Text Request
Related items