Font Size: a A A

Study On Information Autonomous Extraction Technology Of Web Pages

Posted on:2007-08-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y MiaoFull Text:PDF
GTID:2178360212958956Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, World Wide Web has become a huge distributed information space, which provides users rapid growth, quickly obtaining what users need on WWW is getting more difficult because of Internet's opening and heterogeneity. IIow to quickly, accurately find the needed information from many information resources has become a difficult problem that puzzled the Internet users. As a compared new research field, IE can be used as a problem of natural languge understanding.Information extraction produces valuable information just through analyzing information and structure of original documents, extracting meaningful facts. IE can help users find and browse useful information in web texts.The characteristics of data in WWW are lack of structure and norms, which will reduce the efficiency of finding information. In order to find right information resources from the whole web correctly and increase the efficiency speed of IE, we will analyze the hypertext files and get the figures of it from three levels. text content,document structure and text formats.The core of any IE system is the model extraction, the model will be used to help extract some related data from pages. At the present, many researchers are studying various methods which can obtain models automatically and have made some progress. The IE methods can be classified as IE inductive method on the basis of level structure and more record IE method on the basis of concept model.The figures of web information must be fully considered while studying IE in web, thus we can extract information in web better. The basic task of IE is to analyze the structure and content of web documents. So the premise of IE is to analyze the web documents.The paper extracts page information on the basis of DOM of W3C.The autonomous extraction on web pages is an intelligent extraction on the basis of IE. Users can customize information, i.e. to visit related pages...
Keywords/Search Tags:Semi-structured Data, Information Extraction, Wrapper, Document Object Model
PDF Full Text Request
Related items