Font Size: a A A

Web Page Attribute Extraction Method Research

Posted on:2013-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:Q S DengFull Text:PDF
GTID:2218330371453155Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The proliferation of Web information make the Web presence of the increasing variety of semistructured information. However, Web access to information that can be mostly semi-structured form of the html page structure, various types of applications can not be direct access to and use.Therefore, automatic extraction html page semi-structured data extraction technology, the Web has become a research hotspot. The researchers conducted an extensive study for Web information extraction, and there were many different principles of Web-based information extraction technology.According to actual demand, the author of this article on the news pages identify issues and failure detection problem Wrapper made a thorough research, and try to solve. This work with contributions mainly contains the following three aspects:1. Proposed a mechanism for identifying the author of news pages: The mechanism in reference plain text of the Chinese name identification method based on the combined characteristics of Chinese names, the context of news features and the author's web page structure, and the use of mutual information theory, this paper presents the news pages of Chinese authors identify the mechanism. 2. Proposed a failure detection mechanism Wrapper: Wrapper induction for information extraction-based approach is commonly used in real network applications network information extraction methods. We proceed from the actual application requirements, drawing on existing research results, made to meet the needs of the Wrapper failure detection mechanism. Wrapper of the mechanism by calculating the properties of extracted result set number of characteristic values to determine the probability of whether the failure Wrapper. Wrapper for the latter automatically provides the necessary basis for maintenance.3. In order to meet the needs of practical application, this web-based news and other Chinese authors identify the mechanism of the news pages of algorithm development, author, source, extraction components. The components of the late public opinion analysis provides important basic data. The components have been used in actual projects, and has achieved good results.
Keywords/Search Tags:Web information extraction, Chinese personal name recognition, Wrapper, Wrapper failure detection
PDF Full Text Request
Related items