Research On The Technology Of Web Information Extraction

Posted on:2009-02-08

Degree:Master

Type:Thesis

Country:China

Candidate:X D Wang

Full Text:PDF

GTID:2178360245988854

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development and popularization of Internet, more and more people obtain information from Web. As the huge information source, Web can be regarded as an enormous database including various valuable information. The goals of Web information extraction are how to extract information which people are interested in and make the extracted information more structured and more semantic. This technology originated from traditional information extraction technology, but it is very different from the traditional information extraction which extracts information from pure text document.At present, a large amount of web information is stored in the database of websites, the display of this information on the web page has some common characteristics that the subject part of the web page is made up of several information blocks and each information block has several data items. This kind of web pages is called data-rich web page. Studying how to extract information from it makes sense and is very valuable.This thesis focuses on the methodology of extracting information from the data-rich web pages mentioned above. In this thesis, the XML related technologies are applied to solve the problem of web information extraction. The steps of the solution are: firstly to obtain the web page, secondly to normalize the HTML document into formal XML document using DOM, and thirdly to treat the information layout as two-dimentional table, then to produce extraction rules based on XPath representing of row and column through interacting with users, finally to extract information using extraction rules and then to describe the extraction results using XML document.A prototype system has been implemented, it can extract information from data-rich web pages and news web pages. The results of the experiment prove the approach is practicable.

Keywords/Search Tags:

Web information extraction, DOM, XML, XPath

PDF Full Text Request

Related items

1	Research On Web Informaition Extraction Techniques
2	Design And Implementation Of Accurate Web Information Extraction System
3	Semi-structured Web Information Extraction Technology And Its Application
4	Semi-structured In The Xml-based Web Information Extraction
5	Data Extraction Technology Research Based On The Location Of Web Information
6	Research Of Web Information Extraction Based On XML
7	Research And Application On The Technology Of Web Information Extraction Based On The HTML
8	Based The Multidimensional Semantics Internet Drug Information Extraction Research Applications
9	Research On The Technology Of Web Information Extraction
10	Research Of User-defined Requirements’WEB Information Extraction Based On XML