Font Size: a A A

The Technology Of Web Information Extraction Based On HTML Parser

Posted on:2008-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2178360215482415Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid growth of the Web contents increases the need for some automatic tools to help to find the exact information among the magnanimous information sources such as titles , links , emails , pictures etc. The Web pages expressed by HTML, after analyzed by Internet Explorer , are suitable for browse , but not for machine processing as the way of data exchange. Web information extraction is the process of extracting interesting information from Web documents. This technology is mainly used in meta-searching and information agent.Firstly,this paper introduced the background of information extraction and its history, analyses the system architecture,the taxonomy of information extraction and the key technology and weighing measure of information extraction.Secondly, this paper introduced the make up of Web page, the principle of HTML Parser and related knowledge of regular expression.Thirdly, put forward according to the topic of information of Web extraction (the focused extraction) the system model, mainly mean to search those optionally with define the behavior that the topic gathers the related page to carry on extraction in advance. In the paper introduced the extraction system model, analyzed various functions module realization principle in the system.Finally,Based on HTML Parser and regular expression,Taking extracting email information inside websites as an example, the scheme of design was proposed. The principle of email extraction and key technique were presented. The algorithm of email extraction was given. URL extraction module,email extraction module and storage module were described in detail. The result of extraction is stored in database for the use of data retrieval.
Keywords/Search Tags:Web information extraction, regular expression, package HTML Parser, topic information
PDF Full Text Request
Related items