Copy Detection Method For Chinese Documents Based On Fingerprint And Semantic Knowledge Representation

Posted on:2011-08-14

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Li

Full Text:PDF

GTID:1118360302994396

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Copy detection technology for natural language documents is an important topic in theinformation processing field, and it is a powerful tool to protect the author's intellectualproperty and to improve the efficiency of information retrieval. Document copy detectionis to judge whether the given document plagiarize content of other ones in the database,which plagiarism occurs in some ways, such as by duplicating partial or total documentcontent, by using different words or sentences to express the same meanings of the text ofpervious documents in the database. On the basis of the theories of the existing documentcopy detection systems, copy detection method for Chinese documents is studied in thisdissertation. The proposed method uses fingerprint and semantic knowledge representationto find automatically the overlaps between Chinese documents.Firstly, functions, merits and defects of the existing systems are analyzed, and afingerprint-based copy detection method for Chinese documents is proposed. Accordingas the property of document copy detection, a detection granularity parameter and a noisegranularity parameter are defined. The Hash function is used to map two sequence of chunksthat have been eliminated noises to the set of the corresponding values. A window-basedalgorithm is proposed to extract fingerprints in the sequence of values and the overlap be-tween a query document and the ones in the database is calculated by the defined formula.The overlap of the query document is used to judge whether the one is a copy.Secondly, a word sense tagging method for Chinese full-words based on unsupervisedlearning is proposed. The correct senses of monoseme and classified polyseme are taggedby the dictionary definition of HowNet and the part of speech of the word. Based on theactual application, an improved word sense disambiguation method is used to tag the mostappropriate senses of the other kinds of polysemes in a particular context. The existingEM(Expectation Maximization) algorithm computes expensively and converges slowly. Toaddress the problems, mutual information theory based on Z-test is used to select featuresand a statistical learning algorithm is presented to estimate initial parameter values.Thirdly, an unsupervised Chinese parser based on probabilistic context free grammaris proposed. To address the limitations of probabilistic context free grammar, the context information is used and a new syntax structure probabilistic estimation function combinedthe co-occurrence information of part of speech and syntax category is introduced. Thesyntax parsing algorithm is described and Inside-Outside algorithm is used to obtain theprobabilities of the syntactic rules and the structure co-occurrences from the raw materials.The problem that build the larger scale Chinese treebanks is avoided.Finally, in order to conclude the meaning of a sentence by the syntax structure and thesenses of the substantives in the sentence, a frame-based semantic knowledge representationmethod is proposed. The representation can express the meaning of words, phases and sen-tences, and show distinctly the hiberarchies between the semantic units. The complicateddocument copy patterns, such as single-word synonym, voice, part of speech and break-ing long sentence, are found by the copy detection method based on semantic knowledgerepresentation, and the corresponding overlap measure method is presented.The experiments confirm the optimal values of the defined parameters and validatethe correctness and validity of the proposed methods. The fingerprint-based copy detectionmethod uses the string matching idea to find the overlaps between Chinese documents. Byanalyzing and matching the meanings of the sentences in a document, the copy detectionmethod based on semantic knowledge representation realizes really to find the overlapsbetween Chinese documents from the semantic level of natural language processing.

Keywords/Search Tags:

Information Processing, Document Copy Detection, Fingerprint, Semantic Knowledge Representation, Semantic Analysis, Disambiguation, Expectation Maximization Algorithm

PDF Full Text Request

Related items

1	The Representation Of Chinese Semantic Knowledge And Its Application In The Chinese-English MT System
2	Research On Technologies Of Knowledge Graph Representation Learning Based On Semantic Analysis
3	Based On Semantic Document Information Resources Retrieval System Design And Realization
4	Research On Rough Classification Of Academic Papers Based On Topic And Semantic Fingerprint Fusion
5	Research On Chinese Word Sense Disambiguation Based On Semantic Analysis
6	The Semantic Information Automatic Generation
7	Research On Ontology Representation Of Educational Information Processing Oriented Semantic Web
8	Research On Semantic Processing Technology Based Information Retrieval Model
9	Investigation Of Categorical Semantic Information Processing In The Brain And Natural Language Processing Models
10	Research And Implementation Of Abnormal Data Processing Technology For Web Semantic Tables