Font Size: a A A

Research And Implementation Of Automatic Information Extraction From Web Pages

Posted on:2010-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2178360275951564Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,Web has become a vast, distributed and shared information resource.Now,Most web pages are in the form of HTML.Due to the semi-structured nature of HTML pages,it is easy for people to explore web pages,while it is difficult for applications to process and use the data. To strengthen the availability of web data,Web information extraction technology comes out,which wraps the web resources,extracts semi-structured data,and provides supports to applications using web data.Therefore,the research of Web information extraction has attracted much attention from researchers in recent years.Wrapper technology will completes information extraction from web pages,it includes three phases:manual phase,semi-automatic phase and automatic phase.In manual and semi-automatic phase,there are several existing difficulties:Firstly, these wrappers request users must master related professional knowledge.Secondly, it's not easy to maintain these wrappers.Based on the study of existing information extraction technology,we propose a tree-structure-based web data extraction algorithm.The main contributions of this dissertation are listed as follows:1.A similar web pages acquiring algorithm is proposed and developed.We try to analyze web pages' structure,and get to know the type of web pages.Then we will take different measures.The method has high accuracy.2.A tree-structure-based web data extraction algorithm is proposed.Different pages comparing,we will get the final wrapper tree,and we will also confirm data pattern.After the wrapper tree is added semantic information,web data will be extracted correctly.Comparing to the existing web data extraction algorithm,my algorithm has promoted a lot.3.A general web information extraction system is designed and developed. With this system,users can get interested information form HTML pages,and the system has the generality and flexibility.The thinking in Web information extraction presented in this dissertation can better solve the problem of web information extraction,and the precision can reach a higher proportion.
Keywords/Search Tags:Web information extraction, DOM tree, Wrapper
PDF Full Text Request
Related items