Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website

Posted on:2008-09-06

Degree:Master

Type:Thesis

Country:China

Candidate:C L Wang

Full Text:PDF

GTID:2178360212974252

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The Internet is an extremely large information repository with its data amount ever-increasing in an exponential rate. This provides users with a valuable resource of information. However, the information in the Internet is massive, heterogeneous, variable and non-semantic, which makes it difficult to retrieve relevant data quickly and accurately from the tremendous amount of web pages. Therefore, the availability of robust, flexible and automatic tools that can help users effectively retrieve information from the Internet has become a great necessity.This thesis presents a novel web information automatic extracting mechanism based on the website semantic structure, which trying to extract information using the semantically-meaningful logical view of the website, so the computer can be made to understand the meaning of information to a certain extent, attaining the goal of making the process of information extraction more efficient.This thesis designs a web information extraction system which based on semantic structure of the website. The system consists of three main components: website spider, website semantic structure generator, web information extractor. The task of website spider is to search the target website, provide relations of links to generate the website direct graph, download pages to extract relevant information. The task of website semantic structure generator is to translate the website direct graph (the physical structure of the website) to the website semantic structure based on the web page classification which has been done by website designer according to his understanding of the content of web pages, namely to produce a category relationship chart in accordance with the semantic classification of the website. The task of web information extractor is to extract relevant information based on this classification.A website spider was implemented in the thesis. The website spider can traverse websites, download web pages and generate website direct graphs. Several key issues about implementation are demonstrated in details. The thesis also proposes a web page classification based on the semantic meaning of the website. When the website direct graph has been constructed, a topology structure which reflects the website semantic meaning will be generated based on web page semantic classification and website direct graph. When the semantic structure of the website has been constructed, web information extraction can be done based on this structure. The thesis presents a web...

Keywords/Search Tags:

web information extraction, web page classification, tag tree, web page noise reduction, spider

PDF Full Text Request

Related items

1	Research And Implementation On Key Technology Of Web Text Collection And Analysis
2	Research On Mining Structure Of WEB Page For Information Extraction
3	The Study And Implementation On The Key Problems Of Intelligent Search Engine Technology
4	Research On Web Page Classification And Information Collection
5	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree
6	Research And Implementation Of Chinese Web-page Classification Based On Web Data-mining
7	Design And Implementation Of DOM-based Noise Reduction System
8	Reasersh On Internet Public Opinion Information Extraction And Classification
9	A Research On Statistic-based Classification Of Chinese News Web Page
10	Research On Web Article Automatic Extraction Method Based On Page Segmentation