Research On Text Document Information Hiding

Posted on:2008-03-30

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Z X Dai

Full Text:PDF

GTID:1118360272966634

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Text documents, a widely used information storing and convening media, are hotly studied to be applied in the field of secret communication, copyright protection and content authentication of text document. Due to the lacking of redundancies of the human audio system(HAS) and human video system(HVS) that are usually found in media of image, audio, and video, and the lacking of strong theories and practical auto-technology in the natural language processing to understand, transform, and generate the content of text document, the research of information-hiding technology in text document is very challenging.The concept, modal, application of information hiding and its research conditions at home and abroad are firstly elaborated in the dissertation. Considering that natural language sentences are alignment of words with each word having its special Part of Speech such as noun, verb etc., a sentence can be changed into a string of Part of Speech tagging. To study the part of speech tagging as a transform domain of the text document, several new watermarking methods based on the field are thus proposed.Generally speaking, the number of the Part of Speech in any natural language is limited. If a proper partial relation is defined in the collection of Part of Speech tagging, then the tagging string would have a sequence order. The tagging string can thus be reflected as 0 and 1 by calculating the odd and even characteristic of the inverse order. Research shows that the corresponding binary sequence of the casually selected tagging strings has a better self-relation and co-relation. So a new method by using the odd and even characteristic of the inverse order to hide information is proposed in this dissertation, and the inverse order has also been proved to have an odd and even characteristic under the condition of being exchanged, added and deleted. In order to hide information, the tagging string is firstly properly changed, and then the natural language sentences are modified with the instruction of the changed tagging string. So the possibility of the modification has been theoretically guaranteed, and the blindness of direct modification in the natural language sentences has been avoided. At the same time, this method can avoid the attacking of synonyms replacement.Because a tagging string is in fact a sentence pattern owing to the stability of the sentence patterns of natural language, the corresponding tagging string of a text document becomes statistical. An information hiding scheme is proposed on the basis of information entropy, which can make the entropy comply with the hidden information by changing the probability distribution of the tagging strings. As watermarking function is valued in real number set, and its range is only constrained by the preciseness of calculation, so the watermarking capacity can be greatly improved. This algorithm can resist synonyms substitution and sentence misplacement so as to avoid the problem of synchronism in the process of information extraction. The security analysis adds to the complexity of the adversary trying to find out the watermarking. However, for a given entropy of a random variable, an nonlinear equation with multivariable must be solved to find the corresponding probability distribution, in this paper, a method which transforms a nonlinear equation with n-variables to at most (n-1) nonlinear equation with one variable is presented. Thus the correctness of the calculation is proved and the evaluation of the error is given.The cover text document formation technology proposed by some Wayners constructs rules of composition at the level of natural language, in which the language of the rules of composition is a subset of the natural language, so the requirement for the rules of composition is very high in order to ensure the conformity of the meaning of the carrier text document without the suspicion of any outside person. It is also very difficult to realize automation. This dissertation proposes two information hiding methods through tag string coding on the basis of Huffman tree and Part of Speech grammar. We can reflect both the secretary message and carrier text document into the variable range of the tagging string, then use the technology of matching to calculate the position of the secret message carrying sentences in the carrier text document, such position is what we can secret key. Because the sender and the receiver have identical tagging string collection and reflective function, the tagging string can be attained from the carrier text document by means of the secret key, and the secret message can be read through decoding the tagging string. This dissertation gives a capacity equation of information hiding. As the secret message is hidden in the text document pattern instead of the sentence itself, this calculation makes it possible to select the carrier text document at random and consequently avoid the problem of syntactic conformity. It is easy to construct the tagging of composition rules according to the Part of Speech and realize syntactic analyzing by YACC, which can be easily made automatic.Centroid detection used to attain message in the information hiding technology of line coding can produce obvious detection error in the case of shorter text lines. Although Low has pointed out that the reason for the ineffectiveness of the centroid detection on short text lines lies in the great deviation of the cetroid noise caused by the short text lines, he hasn't worked out any improvement by far. This dissertation proposes improvement for the typical centroid detection through simulated expansion of the initial text line, comprehensive application of the message from the reproduced text line profile and the original text line profile to construct a reproducing centroid array for text line simulation. By using MATLAB the line coding and centroid detection has been realized, and the result of the experiment shows that the new approach may half down the probability of detection error in the case of short text lines in comparison with the typical centroid method. The conclusion suggests that line-shift coding would not subject to text line length again, either long lines or short lines are all used to embed watermark, and the watermark capacity is increased.

Keywords/Search Tags:

Information hiding in the text document, Natural language processing, Part of speech tagging, Inverse order, Entropy, Centoid detection, Part of speech tagging string coding

PDF Full Text Request

Related items

1	Chinese Word Found Its Part Of Speech Tagging
2	Study Of Kazak Part-of-Speech Tagging Based Upon HMM
3	Study Of Chinese POS Tagging Based On Maximum Entropy
4	Research On Parallel Corpora-based Unsupervised Part-of-speech Tagging For Chinese
5	Research On Lao Language Part-of-speech Tagging With Multiple Features
6	Research On Part-of-Speech Tagging Algorithms Of Mathematical Corpus Based On Deep Learning
7	A Research On Lao Language Part-of-speech Tagging With Multi-feature Fusion
8	Research On Kirghiz Basic Part-of-Speech Tagging Based On HMM
9	Part-of-speech Effect And Affect In Search That In Chinese Literature Of Science And Technology
10	Research On Text Classification Method Based On Part Of Speech Tagging LDA Model