Font Size: a A A

Research Of Automatic Metadata Extraction From Template Web Pages

Posted on:2011-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2178360305468307Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The development of human society has experienced an agricultural society, industrial society, information society, and is moving towards intelligent society. It can be said that the human society is currently in the transition phase from the information society to the intelligent society. At this stage, the information is still the mainstream and the foundation. People's learning, living and working in a large extent depends on the Internet information resources. Carriers of Internet information resources are in various forms, with text, audio, video and graphics. However, currently the computer's ability to understand the information carrier is still very low, and the analysis of text of is much mature, but the voice processing, graphics, image recognition, and video recognition are still in the initial stage. In addition, if people want to collect information in the vast information ocean, it is not feasible solely relying on the strength of man, it also needs the fast processing power of computer. Therefore, the way for people to collect information is mainly extracting text information from Web pages on the use of computer.Information extraction technology offers great help for people to gather information, and it induces the shifting of people's roles from mechanical replicating of information to being the decision-makers of making rules to a large extent. However, in the professional services-oriented information focusing system, it is time wasting and boring to making parsing rules manually because of the huge amount of information source Website. How do people and machines complement each other, and not only take advantage of people's decision-making capacity, but also take advantage of machine's fast processing capabilities to get higher accuracy and efficiency of information extraction, which is the main content of this paper.This paper presents a framework of automatic metadata (also calls subjective information or thematic information) extraction system on template Web pages. In this frame, the process of automatic metedata extraction is composed of three modules, which are extraction rule generation module, metadata extraction module and monitor & automatic feedback module, and the most important one among the three is the first one, and it is also the foundation of the whole frame. The implementation of algorithms involved in extraction rule generation module is discussed in detail. The process of the generation of extraction rules is divided into three phases:the document pre-processing stage, the positioning of the theme blocks stage and the precise positioning phase of thematic information. In the Web document pre-processing stage, firstly, it uses the HTMLParser to transform the Web document into a DOM tree, and then through the independent nodes filtering algorithm (IDNFA) and invalid nodes filtering algorithm (IVNFA) to filter the noise information. The second stage mainly focuses on positioning the subject blocks, and it is divided into two sub-phases. The first sub-phase is the location of dynamic regional block. Due to the fact that the theme region is a dynamic regional block, so firstly it uses DOM tree matching algorithm to compute the max match value of two template documents'corresponding DOM trees, and separate repeated sub-trees from non-recurring sub-trees, and then locate the dynamic blocks. The second sub-phase is the process of the filtering of non-thematic links block, it firstly locates the repeat region of DOM tree, and then filter non-thematic links block through the statistic of the numbers of links and non-links. In the process of accurate positioning stage, it develops some heuristic rules by analyzing the various features of information, and locates the subjective nodes according to these heuristic rules, and then uses the path of subjective node which in the DOM tree as extraction rules.Finally, it shows the main interface of the automatic information extraction system, and gives the extraction results of two news sites. In addition, this paper uses extraction recall rate (shorted for ERR or R-Measure) and extraction accuracy rate (shorted for EAR or P-Measure) to indicate the performance of the system, and the results show that the system can effectively replace the manual work to automatically extract information from Web pages, and taking a good effect, this also shows that it has good application value.
Keywords/Search Tags:Web pages, subjective metadata, automatic information extraction, heuristic rule
PDF Full Text Request
Related items