Font Size: a A A

Research On Key Technologies Of Recovery Of Corrupted OpenXML Compound Documents

Posted on:2018-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:D Y YangFull Text:PDF
GTID:2348330563451332Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the popularity of electronic office,the compound document plays an increasingly important role in our daily life.In the process of the transmission and storage,compound documents often can't be opened because of the high bit errors,so the effective contents from them can't be obtained.This situation has seriously affected the improvement of work efficiency.Therefore,it is particularly important to study how to make the useful information of the corrupted compound document visible to users.In particular,under the condition of one-way communication,the unequal status of communication between the two sides makes it more significant to recover the corrupted compound documents.This paper selects OpenXML compound documents as the research objects to expand in-depth study.By taking the actual problems in the process of document recovery as the starting point,the related algorithms and applications have been discussed in this paper.The main achievements of the paper are stated as follows:1.The OpenXML compound document recovery model based on recombination of key components is established.By exploring the source organization structure of the OpenXML compound document,analyzing the format of the document,summarizing the potential protocol redundancy,and evaluating the effect of bit error location on the document recovery,the recovery model is proposed.2.The OpenXML compound document recovery method based on recombination of key components is proposed.As the experimental study shows,when the OpenXML compound document is corrupted,damage to the components which are not related to the contents of the document,can also cause the document to fail to be opened.To solve the problem,a recovery method is proposed by using the robustness of the documents.The method presents an idea that OpenXML compound documents can be reconstructed through some key XML files and relational files,which achieves the maximum acquisition of the bearing information in the corrupted compound documents.The corresponding simulation experiments show that this method can effectively reduce the content of the document to a certain extent,and extract the useful information from it.3.Two kinds of algorithms for error-tolerant delimitation for the valid data fields are proposed as follows:?1?Considering compared with the exact pattern string matching algorithm,the approximate pattern string matching algorithm has much higher computational complexity,an algorithm for error-tolerant delimitation based on double mechanism of string matching is proposed to achieve the delimitation for the valid data fields of the document.Firstly the algorithm converts the data stream of bit into the data stream of character,and then the starting positions of the valid data fields are achieved by establishing a double mechanism combined with the two kinds of string matching,finally combines the length field to complete the delimitation for the valid data fields.The simulation result shows that under the condition of high bit error rate(10-610-4),this method achieves the ideal delimitation effect while ensuring a high delimitation efficiency.?2?In order to solve the problem that it is easy to make mistakes on error-tolerant delimitation for the File Source Data during the recovery,an algorithm for error-tolerant delimitation for the File Source Data of OpenXML compound document based on multiple constraints is proposed.With the analysis and classification of the document protocol redundancy,the paper transformed the problem of the delimitation of the File Source Data into the optimal estimation of the initial position sequence.On the basis of the completion of the rough delimitation by relaxing the matching conditions appropriately,the cost function was constructed to filter the observation data.By using the constraint relation in redundant information,this method removed the situation of false-alarm effectively.The simulation results indicate that the algorithm can decrease the rate of error delimitation significantly,overcoming the conventional delimitation methods'sensitivity to bit error.4.A recovery software module of the corrupted docx document is designed and implemented based on Windows MFC.The software module is provided with the function of not only adding noise to the document and collecting statistics of the document bearing content extraction rate,but also recovering the simulation and actual broken documents.At the same time,the visual operation of the recovery of corrupted docx documents is realized.The software is tested by simulation and actual documents,and the recovery effect is pretty good.
Keywords/Search Tags:OpenXML Compound Document, Recovery, High Bit Error, Key Component, Protocol Redundancy, Error-Tolerant Delimitation
PDF Full Text Request
Related items