Font Size: a A A

Study Of Web Data Extraction Based On Webpage Structure

Posted on:2010-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:H C ZhuFull Text:PDF
GTID:2178360278457599Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet, the data on the web spread without restriction, one can't find the required data quickly and accurately from mass web data, how to quickly and accurately obtain these data is a urgent problem need to resolve. Web data extraction technology has become a hot research. Through analyzing the structure of the data which was got from a particular website or web page, setting particular extracted rules, we can extract interesting information, and save into database or other formatted files for SQL or XML query language to query, or providing for other applications.This thesis introduces the Web data extraction research and the Web data extraction model. A prototype system was designed with Java and used to do extraction data based on HTML. Since not referring to the arrangement structure of HTML documents, the system can't meet the extraction requirement. Through analyzing the arrangement structure of HTML, a method that using XSLT files to map could make well-formed result for special web pages. But the commonality of the method isn't so good, and requires the structure of web pages strictly. Finally, this thesis proposes a method of Web data extraction for special content, uses parsing algorithm combined with DOM to select special nodes and mapping with XSLT files. To a certain extent the method meets the commonality, and makes analyzing for special content (News web), the experimental results show that the method is feasible in a certain degree.
Keywords/Search Tags:Web Data Extraction, Arrangement Structure of HTML, XSLT, DOM
PDF Full Text Request
Related items