Font Size: a A A

Implementation Research On Web Information Extraction System Based On Network Packets

Posted on:2008-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:S DengFull Text:PDF
GTID:2178360272969127Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The control and management of the network information is one of the most important components, related to national security. It's of great realistic significance to protect national development and social stability, safeguard national sovereignty, and ensure the normal order of public network information.The related technology of Web information extraction is introduced. It includes two aspects. The first aspect is Network packets capture. It mainly focuses on Libpcap and Winpcap; the second aspect is HTTP (HyperText Transfer Protocol) protocol, including the composition of this protocol, the related parameters, the MIME types and the compression method which HTTP used.Base on the demand analysis and the application environment, The WIES (Web Information Extraction System) is defined, including the design principles and ideas. The general framework and the basic functions of every functional module are proposed. After analyzing the defects of Winpcap packet capturing method, four optimized measures are given: moving the application to the kernel level, bypass the core protocol stack, decreasing hardware interrupt and multi-copy packets. Multi-copy is selected as the implementation measure.A common framework which is an integration solution of both HTTP/1.0 and HTTP/1.1 Web information extraction is designed. The extraction measurement based on HTTP/1.0 is designed. Focusing on the persistent connection, chunk encoding and data compression, a set of measurements are designed so that WIES can support HTTP/1.1. Experiments on WIES show that It can recover the Web information based on HTTP/1.0 and HTTP/1.1 effectively, stably and excellently.
Keywords/Search Tags:Web Information Extraction, Network Packet Capture, Persistent Connection, Message Compression Method, Chunk Encoding
PDF Full Text Request
Related items