Font Size: a A A

Based On The Html Pages Of Web Information Extraction

Posted on:2007-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y L YuanFull Text:PDF
GTID:2208360185456135Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web information extraction is the process of extracting interesting information from Web documents. This technology is mainly used in meta-searching and information agent.This paper introduced the background of information extraction and its history, analyse the system architecture, the taxonomy of information extraction and the key technology and weighing measure of information extraction.Introduce a method of filtering based on the domain knowledge. That system is divided into two parts: The first part is give according to the expert of rule match through a rule to a great deal of web page carry on valuation's measure and select the particular realm of web page.The second part is in the square first one for have already filtered the web page of to carry on the URL Clustering,thus be used for the web page that Information Extraction.Put forward a kind of method that topic information extraction from template web pages.The main characteristics of that method is: 1) Direct carry on extraction to the topic information, but do not need to pass the way of do away with the web page noise to extraction the topic information; 2) To together output great deal of web page of template, after passing the generative template of the machine study, can extraction the web page topic information directly then, but do not need to carry on the analytical processing to each web pages.3)With the news web page extraction analyzes that for the example method of concrete usage.Put forward according to the topic of information of Web extraction( the focused extraction) the system model, mainly mean to search those optionally with define the behavior that the topic gathers the related page to carry on extraction in advance. In the paper introduced the extraction system model, analyzed various functions module realization principle in the system.
Keywords/Search Tags:Web information extraction, HTML structure tree, topic information
PDF Full Text Request
Related items