The Design And Implementation Of A Multi-user Based Web Information Extraction System

Posted on:2011-10-12

Degree:Master

Type:Thesis

Country:China

Candidate:S Li

Full Text:PDF

GTID:2248330395455551

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The rapid growth of Web information has made a huge challenge for information extraction and effective usage, especially for some enterprise customers, how to extract the useful information from a large number of sites’ pages is an urgent problem to solve. The project of this subject is a multi-user based web information extraction system developed for these enterprise users, it implements a visual interface to configure the extraction rules, and use C/S architecture structure to put the users’extraction tasks running on the remote server, thus reducing the users’ burden of extraction rules configuration and tasks running maintain.This subject system includes client and server-side components, server side includes the extraction server and the central server. Users use client to set up extraction projects and upload them to the central server, the central server assign the tasks to the extraction server for information extraction, when finish the extraction, the extraction server will send the result to user by Email, Ftp, etc.This system’s architecture is using C/S and B/S combinational structure, and mainly developed using the Python programming language. The client using Javascript and DOM technology to implement a visual interface for extraction project configuration, use XML-RPC protocol to communicate with the server. The extraction server tidy the pages using Html5lib, and extract page information by XPATH, it can deal with various pages (includes some non-standard and HTML5pages). The central server use multi-process and database to distribute and manage the tasks, and use Django framework to complete a web interface background management module for the administrator to manage users’accounts and extraction tasks. At the same time Cacti server has been installed and configured on the server to monitor the system for ensuring the system’s stability and reliability.The extraction rules’ configuration of the subject system is completely visual, no need to write any script, operating is very easy; Extraction tasks runs on the remote server, it can provide stable, continuous information extraction service without users’ maintenance. Currently the system has completed testing and commissioning, trials with successful results, compared with the similar software, it has advantages for easy operating and low cost, and is suitable for large-scale marketing.

Keywords/Search Tags:

Web Information Extraction, multi-user, XPATH

PDF Full Text Request

Related items

1	Research Of User-defined Requirements’WEB Information Extraction Based On XML
2	Based The Multidimensional Semantics Internet Drug Information Extraction Research Applications
3	Research And Implementation Of Web Information Extraction Based On XML
4	Research On Web Informaition Extraction Techniques
5	Design And Implementation Of Accurate Web Information Extraction System
6	Semi-structured Web Information Extraction Technology And Its Application
7	The Study Of Semi-supervised Web Data Extraction Rule Induction Based On User Interaction
8	Semi-structured In The Xml-based Web Information Extraction
9	Data Extraction Technology Research Based On The Location Of Web Information
10	Research Of Web Information Extraction Based On XML