| With the rapid development of Internet, web has become one of the most important knowledge repositories. It's a valuable and promising research field to extract and make full use of these information and knowledge. The goal of Web Information Extraction (Web IE) is to locate,identify the interested information from heterogeneous web sites, and to organize the extracted information in a homogeneous and structured format. The inherent features of Web site including huge number,various format and frequent updating cause the problems to Web IE, including complexity, adaptability and scalability.Architecture of Web IE using BP Neural Network is proposed in this thesis based on the analysis of features of semi-structured documents. The system uses XML as representation model of web pages, BP Neural Network to learn extraction rules. It including several knowledge repositories as well as three modules of Web pages preprocessor, rule learning and information extraction, describing the web pages by four sides: semantic content display, logic structure, rule generation and extraction results.This thesis concentrates its research mainly on the method of rule learning base on BP Neural Network which definition of rules are combined the web pages'features of path, left/right boundary and semantic. Neural Network makes label elements of filter DOM tree as input, extract results as output, training via BP learning arithmetic. And generating simple and strong rules which can used by information extraction module through rule learning arithmetic after training.Experiments indicate that the system can learn rules about interested domain, also has good adaptability and scalability. |