Study On Information Autonomous Extraction Technology Of Web Pages

Posted on:2007-08-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y Miao

Full Text:PDF

GTID:2178360212958956

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet, World Wide Web has become a huge distributed information space, which provides users rapid growth, quickly obtaining what users need on WWW is getting more difficult because of Internet's opening and heterogeneity. IIow to quickly, accurately find the needed information from many information resources has become a difficult problem that puzzled the Internet users. As a compared new research field, IE can be used as a problem of natural languge understanding.Information extraction produces valuable information just through analyzing information and structure of original documents, extracting meaningful facts. IE can help users find and browse useful information in web texts.The characteristics of data in WWW are lack of structure and norms, which will reduce the efficiency of finding information. In order to find right information resources from the whole web correctly and increase the efficiency speed of IE, we will analyze the hypertext files and get the figures of it from three levels. text content,document structure and text formats.The core of any IE system is the model extraction, the model will be used to help extract some related data from pages. At the present, many researchers are studying various methods which can obtain models automatically and have made some progress. The IE methods can be classified as IE inductive method on the basis of level structure and more record IE method on the basis of concept model.The figures of web information must be fully considered while studying IE in web, thus we can extract information in web better. The basic task of IE is to analyze the structure and content of web documents. So the premise of IE is to analyze the web documents.The paper extracts page information on the basis of DOM of W3C.The autonomous extraction on web pages is an intelligent extraction on the basis of IE. Users can customize information, i.e. to visit related pages...

Keywords/Search Tags:

Semi-structured Data, Information Extraction, Wrapper, Document Object Model

PDF Full Text Request

Related items

1	Research On Semantic Information Extraction For Semi-structured Documents
2	Research On Keyword Extraction And Structured List Data Extraction
3	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
4	Technology For Domain-oriented Automatic Information Extraction From Semi-structured Web
5	Research And Application Of Extraction Method Of Semi-structured Text Information
6	Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies
7	Research On Feature Extraction Method Of Semi-structured Document
8	Research Of Schema Extraction Algorithm Of Semi-structured Data Based On OEM Model
9	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
10	Information Extraction For Semi-structured Chinese Resume