Study Of Web Data Extraction Based On Webpage Structure

Posted on:2010-04-11

Degree:Master

Type:Thesis

Country:China

Candidate:H C Zhu

Full Text:PDF

GTID:2178360278457599

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet, the data on the web spread without restriction, one can't find the required data quickly and accurately from mass web data, how to quickly and accurately obtain these data is a urgent problem need to resolve. Web data extraction technology has become a hot research. Through analyzing the structure of the data which was got from a particular website or web page, setting particular extracted rules, we can extract interesting information, and save into database or other formatted files for SQL or XML query language to query, or providing for other applications.This thesis introduces the Web data extraction research and the Web data extraction model. A prototype system was designed with Java and used to do extraction data based on HTML. Since not referring to the arrangement structure of HTML documents, the system can't meet the extraction requirement. Through analyzing the arrangement structure of HTML, a method that using XSLT files to map could make well-formed result for special web pages. But the commonality of the method isn't so good, and requires the structure of web pages strictly. Finally, this thesis proposes a method of Web data extraction for special content, uses parsing algorithm combined with DOM to select special nodes and mapping with XSLT files. To a certain extent the method meets the commonality, and makes analyzing for special content (News web), the experimental results show that the method is feasible in a certain degree.

Keywords/Search Tags:

Web Data Extraction, Arrangement Structure of HTML, XSLT, DOM

PDF Full Text Request

Related items

1	Collecting Technology, Based On The Ontology Web Non-normative Knowledge Processing
2	Based On The Html Pages Of Web Information Extraction
3	The Research On Web Information Extraction Based On HMM
4	Research And Design, Based On Xml And Xslt, Web Information Extraction
5	Data Extraction And Integration In HTML Tables
6	Research On The HTML And PDF Informaiton Extraction Technology Based XML
7	Research Of Web Information Extraction Based On Table Structure
8	The Research On Key Technologies Of Intelligent Course Arrangement And The Development Of System In Police College
9	Smart Client And Office System Integration Of Applied Research
10	Research And Application On The Technology Of Web Information Extraction Based On The HTML