Font Size: a A A

The Research And Implementation Of A Web Information Extraction System Based On Grid

Posted on:2007-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:Z X GongFull Text:PDF
GTID:2178360185478169Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The data existing in web documents are usually wrapped by a mass of HTML labels, which brings trouble to application systems that directly use data in web documents. Web information extraction faces magnanimity data and depends on a lot of complex algorithms, which results in poor work efficiency on common platform. Grid can offer distributed parallel environment, whose service-oriented open architecture raises flexibility of application and code reuse rate. So research and developing an automatic web information extraction application based on Grid is a very important and significant task.The thesis introduces related techniques of web information extraction, and analyses the algorithm adopted by RoadRunner, which is an excellent automatic web information extraction system. Then the thesis presents related work of Grid, focusing on researching the characteristic of Grid application. In the following sections, the thesis solves the two problems: one is how to extract web information automatically and another is how to implement it on Grid platform. In the first part, the thesis solves the problem of automatically obtaining a set of similar pages through some effective heuristic rules, and puts forward two algorithms respectively aiming at two-staged cleaning web noisy information and deducing extraction rules automatically. In the second part, this thesis analyses parallel-enabled parts of the web extraction application, gives the corresponding Grid application model and programming mode, introduces how to install and configure the Grid platform, describes the detailed steps of developing and deploying a set of services, and...
Keywords/Search Tags:Web Information Extraction, similar web pages, Web noisy information processing, Grid application, GT4
PDF Full Text Request
Related items