Font Size: a A A

Research On The Technology Of Web Information Extraction

Posted on:2009-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:X D WangFull Text:PDF
GTID:2178360245988854Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet, more and more people obtain information from Web. As the huge information source, Web can be regarded as an enormous database including various valuable information. The goals of Web information extraction are how to extract information which people are interested in and make the extracted information more structured and more semantic. This technology originated from traditional information extraction technology, but it is very different from the traditional information extraction which extracts information from pure text document.At present, a large amount of web information is stored in the database of websites, the display of this information on the web page has some common characteristics that the subject part of the web page is made up of several information blocks and each information block has several data items. This kind of web pages is called data-rich web page. Studying how to extract information from it makes sense and is very valuable.This thesis focuses on the methodology of extracting information from the data-rich web pages mentioned above. In this thesis, the XML related technologies are applied to solve the problem of web information extraction. The steps of the solution are: firstly to obtain the web page, secondly to normalize the HTML document into formal XML document using DOM, and thirdly to treat the information layout as two-dimentional table, then to produce extraction rules based on XPath representing of row and column through interacting with users, finally to extract information using extraction rules and then to describe the extraction results using XML document.A prototype system has been implemented, it can extract information from data-rich web pages and news web pages. The results of the experiment prove the approach is practicable.
Keywords/Search Tags:Web information extraction, DOM, XML, XPath
PDF Full Text Request
Related items