Font Size: a A A

Research Of Extraction WEB Information Based On Semantic Markup

Posted on:2013-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaFull Text:PDF
GTID:2248330371958506Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Monitoring and analysis system of against network attack, public opinion analysis and email monitoring provides strong support for the discovery, processing and assessment of network hazardous events. The research of extraction technology of Web information is the basis for these systems.This article describes the main extraction method for current Web information extraction technologies (based on wrapper, DOM tree, visual feature, statistics), and analyzes the strengths and weaknesses of each scheme. Then proposes a new semi-automated Web information extraction method tailored to specific Web pages.The Web information extraction method based on Wrapper , that is the most popular method in information extraction areas. The idea of base Wrapper, this paper proposes a method of using the semantic tags function to locate the Web information position and comments received model by the repetitive theory to extract Web information. This method reduce the DOM tree operations, alleviates the problem of a new website cannot generate correct Wrapper.And the extraction method based DOM tree is one of widely used method for Web information extraction. Based on this idea, this article validates the extracted information in the range of DOM children tree reversely. This method reduces the defect that the pattern-matching cannot validate the data correctness, and will not reduce the information extraction efficiency.Experiment indicates that the Web information extraction methods to some extent to meet the actual demand, and perform better than the usual Web information extraction algorithm on efficiency. When Web pages have 2 comments or more, the Web information extraction method based on semantic tags has a great effect on accuracy rate and recall rate.
Keywords/Search Tags:semantic markup, Web information extraction, repeating principle, public sentiment analysis
PDF Full Text Request
Related items