Font Size: a A A

Design And Implementation Of A High Adaptability Web Information Extraction Mechanism

Posted on:2018-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:S Y LiFull Text:PDF
GTID:2348330518494470Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the data size of Internet has an explosive growth, Web data has been the most potential and valuable source of information, as a result, large data analysis and cloud computing also rise. Some of the current research and applications such as data integration, analysis and integrated systems,recommender systems, data mining systems, are based on the massive Web data.However, the Web page contains not only the important data information, but also a lot of noise information. Extracting the required information from the massive and complex web efficient and accurate for deep mining and getting potential value has been a meaningful and practical study.The core of Web information extraction is the process of extracting the data information points contained in the semi-structured web pages which scattered on the Internet, and converting them into a more clear form of structure and semantics, for deep mining and utilizing.The main contents of this paper is as follows: Firstly, this paper introduces the related concepts and principles of information extraction,current information extraction methods and extraction technologies are analyzed and compared, the characteristics of common information extraction also has been studied; Secondly, an integrated and unified web information extraction mechanism has been designed, based on rule configuration, that combines the advantages of HTML structure based and template, also can quickly adapt to different domain extraction tasks.In the design of the web information extraction system, the model of information extraction is clarified, and the key issues such as definition, and packaging of rule system, information collection,information extraction and automatic navigation are expounded in detail. The rule packing and information collection and extraction are designed independently, and the element information type library also has been designed in the system, and finally it stores the extracted structured data into the corresponding local library according to the needs of users. On this basis,the system has been implemented by using Java and Chrome extension combined mothed.At the end of paper, experimental results and system analysis are given. The results show that the web information extraction system designed in this paper can meet the demands for all types of sites,and its module independent design makes the information extraction efficiency and accuracy performance can achieve a high level.
Keywords/Search Tags:Web Information Extraction, XML, DOM, Rule Template
PDF Full Text Request
Related items