Font Size: a A A

The Design And Implementation Of A Multi-user Based Web Information Extraction System

Posted on:2011-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2248330395455551Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid growth of Web information has made a huge challenge for information extraction and effective usage, especially for some enterprise customers, how to extract the useful information from a large number of sites’ pages is an urgent problem to solve. The project of this subject is a multi-user based web information extraction system developed for these enterprise users, it implements a visual interface to configure the extraction rules, and use C/S architecture structure to put the users’extraction tasks running on the remote server, thus reducing the users’ burden of extraction rules configuration and tasks running maintain.This subject system includes client and server-side components, server side includes the extraction server and the central server. Users use client to set up extraction projects and upload them to the central server, the central server assign the tasks to the extraction server for information extraction, when finish the extraction, the extraction server will send the result to user by Email, Ftp, etc.This system’s architecture is using C/S and B/S combinational structure, and mainly developed using the Python programming language. The client using Javascript and DOM technology to implement a visual interface for extraction project configuration, use XML-RPC protocol to communicate with the server. The extraction server tidy the pages using Html5lib, and extract page information by XPATH, it can deal with various pages (includes some non-standard and HTML5pages). The central server use multi-process and database to distribute and manage the tasks, and use Django framework to complete a web interface background management module for the administrator to manage users’accounts and extraction tasks. At the same time Cacti server has been installed and configured on the server to monitor the system for ensuring the system’s stability and reliability.The extraction rules’ configuration of the subject system is completely visual, no need to write any script, operating is very easy; Extraction tasks runs on the remote server, it can provide stable, continuous information extraction service without users’ maintenance. Currently the system has completed testing and commissioning, trials with successful results, compared with the similar software, it has advantages for easy operating and low cost, and is suitable for large-scale marketing.
Keywords/Search Tags:Web Information Extraction, multi-user, XPATH
PDF Full Text Request
Related items