Font Size: a A A

Research Of Key Technology On Content Extraction And Filteration For Formatted Document

Posted on:2013-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:L R LiuFull Text:PDF
GTID:2248330377458797Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet,network is the convenient way to transforthe information and files.With harmful information and illegal file increasing, the file contentfilteration become the effective way to gurantee the network information security ontransmission.The file content filteration is consisted by two parts: content extraction andcontent filteration.The traditional extract methods have one limitation, they can not extract thetext from the formatting file only when the file is transfered completely. Therefore all thesemethods are unable to meet the needs of real-time extraction in the transmission. For instance,they can not extract the part of the file which have already transmissed.The traditional contentfilter using multi-pattern matching algorithms, they are unable to meet the preocessingrequirements with the complex matching rules. Because a single word can only expresssimple semantic meaning, to achieve the accurate matching results, need combinate somekeywords to be one rule sometimes. In order to obtain accurate and efficent reslt of the filecontent extraction and filteration under the network transmission, On the basis of study aboutcontent extraction and filteration at home and abroad, some innovation work is carried out asfollows:Firstly, Office2007documents and PDF files’s specific file formats were researched,their document structures were analyzed.On the basis of above two work, this paper putsforward a real-time content extract method under the network transmission.The methoddepends on subdivision decompression of the part of the file.Then, extract the text from theresult of decompression by find the tagged words.The whole extraction process requires fourtechnical methods’s support: subdivison decompression, subfivison cache, tagged word searchand text extraction.In addition, throught the study of Boolean expression’s matching algorithm, as theoriginal match algorithm have to sort all the keywords’ interval, then take a traversation on allkeywords. By use the idea of breadth first search, remove the interval which not need to checkto improve the performance of final expression matching. And in order to make the originalinterval algorithm can deal with the Boolean expressions’s set which contain the prefixes. Theoriginal label algorithm will deal with the same keywords in different experssions to thedifferent keyword. This waste the search time. Therefore modify the original label glgorithm to make the key word to reuse the interval as far as possible.This can reduce the number ofinterval which are used to find the matching result.This modification can improve theprocessing performance and reduce the space consumption.
Keywords/Search Tags:Formatted file, content extraction, Boolean expression, matching algorithm
PDF Full Text Request
Related items