Research And Implementation Of Automatic Information Extraction From Web Pages

Posted on:2010-01-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Zhang

Full Text:PDF

GTID:2178360275951564

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,Web has become a vast, distributed and shared information resource.Now,Most web pages are in the form of HTML.Due to the semi-structured nature of HTML pages,it is easy for people to explore web pages,while it is difficult for applications to process and use the data. To strengthen the availability of web data,Web information extraction technology comes out,which wraps the web resources,extracts semi-structured data,and provides supports to applications using web data.Therefore,the research of Web information extraction has attracted much attention from researchers in recent years.Wrapper technology will completes information extraction from web pages,it includes three phases:manual phase,semi-automatic phase and automatic phase.In manual and semi-automatic phase,there are several existing difficulties:Firstly, these wrappers request users must master related professional knowledge.Secondly, it's not easy to maintain these wrappers.Based on the study of existing information extraction technology,we propose a tree-structure-based web data extraction algorithm.The main contributions of this dissertation are listed as follows:1.A similar web pages acquiring algorithm is proposed and developed.We try to analyze web pages' structure,and get to know the type of web pages.Then we will take different measures.The method has high accuracy.2.A tree-structure-based web data extraction algorithm is proposed.Different pages comparing,we will get the final wrapper tree,and we will also confirm data pattern.After the wrapper tree is added semantic information,web data will be extracted correctly.Comparing to the existing web data extraction algorithm,my algorithm has promoted a lot.3.A general web information extraction system is designed and developed. With this system,users can get interested information form HTML pages,and the system has the generality and flexibility.The thinking in Web information extraction presented in this dissertation can better solve the problem of web information extraction,and the precision can reach a higher proportion.

Keywords/Search Tags:

Web information extraction, DOM tree, Wrapper

PDF Full Text Request

Related items

1	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
2	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
3	Research On Automatic And Efficient Technologies For Web Information Extraction
4	Research And Implementation Of Automatic Information Extraction From Web Pages
5	Web Page Attribute Extraction Method Research
6	Algorithm Research For Text Information Extraction Based On Wrapper Model
7	A Web News Extraction Method Based On Filtering Noise Wrapper
8	Web Information Extraction Technology Applied Research, Competitive Intelligence Platform In The Enterprise
9	Research On Web Information Extraction Based On Script Code And Local Data Matching
10	Research Of Data Extraction Technology Based On Tag Tree From List Pages