Font Size: a A A

Vision-based Web Structural Information Extraction

Posted on:2009-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhuFull Text:PDF
GTID:2178360242483064Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The web is perhaps the single largest data source in the world, and more and more organizations release their data through the Internet. Vertical search engines, also known as domain specific search engine, send their spiders out to a refined database and crawl some type of information from the web site, then post the information for user to query after integration and post-processing. Web information extraction technology is the fundamental for vertical search engine, and it's also the kernel module of search engine's back-end. Developing extraction system manually may be simple, but it has many well-known shortcomings such as it's difficult to maintain them because web sites always change in order to survive, and it need one more program to support a new data source which is a waste of resource.This paper presents a vision-based web structural information extraction technology, which not only make use of the structural information of HTML pages, but also take a good use of the vision information. It consists of two steps: (1) identify individual data records in a page, and (2) aligning and extracting data items from the identified data records. In the first step, vision information helps to filter out most of the noise in the web page, which accelerate the algorithm based on HTML structure, it also make the algorithm more accurate. In the second step, the improved tree align algorithm is used for the alignment of attributes, which is efficient and robust. And in the alignment of multiple trees, the introduction of seed tree reduces the computation of the algorithm, so improve the performance when the algorithm applied to large web pages. The experiments show that the extract method has a high degree of automation, need almost no manual intervention, and it's also very efficient and accurate.
Keywords/Search Tags:Vertical search, information extraction, vision-based
PDF Full Text Request
Related items