Design And Implementation Of A High Adaptability Web Information Extraction Mechanism

Posted on:2018-06-21

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Li

Full Text:PDF

GTID:2348330518494470

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, the data size of Internet has an explosive growth, Web data has been the most potential and valuable source of information, as a result, large data analysis and cloud computing also rise. Some of the current research and applications such as data integration, analysis and integrated systems,recommender systems, data mining systems, are based on the massive Web data.However, the Web page contains not only the important data information, but also a lot of noise information. Extracting the required information from the massive and complex web efficient and accurate for deep mining and getting potential value has been a meaningful and practical study.The core of Web information extraction is the process of extracting the data information points contained in the semi-structured web pages which scattered on the Internet, and converting them into a more clear form of structure and semantics, for deep mining and utilizing.The main contents of this paper is as follows: Firstly, this paper introduces the related concepts and principles of information extraction,current information extraction methods and extraction technologies are analyzed and compared, the characteristics of common information extraction also has been studied; Secondly, an integrated and unified web information extraction mechanism has been designed, based on rule configuration, that combines the advantages of HTML structure based and template, also can quickly adapt to different domain extraction tasks.In the design of the web information extraction system, the model of information extraction is clarified, and the key issues such as definition, and packaging of rule system, information collection,information extraction and automatic navigation are expounded in detail. The rule packing and information collection and extraction are designed independently, and the element information type library also has been designed in the system, and finally it stores the extracted structured data into the corresponding local library according to the needs of users. On this basis,the system has been implemented by using Java and Chrome extension combined mothed.At the end of paper, experimental results and system analysis are given. The results show that the web information extraction system designed in this paper can meet the demands for all types of sites,and its module independent design makes the information extraction efficiency and accuracy performance can achieve a high level.

Keywords/Search Tags:

Web Information Extraction, XML, DOM, Rule Template

PDF Full Text Request

Related items

1	Design And Implementation Of A High Adaptability Web Information Extraction Mechanism
2	Technology Research, The Concept Of Tree-based Web Information Extraction
3	The Study Of Rule Induction For Automatic WEB Data Extraction
4	Semi-structured Web Information Extraction Technology And Its Application
5	Web Text Of The Rule-based Information Extraction Technology Research
6	Information Extraction Algorithm Based On The Template Matching In Traffic Standards
7	Design And Implementation Of Web Information Extraction Rules
8	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
9	The Design And Implementation Of Web Information Extraction System
10	Research On Web-based Full-station Data Information Extraction Based On Template