Web Page Attribute Extraction Method Research

Posted on:2013-01-08

Degree:Master

Type:Thesis

Country:China

Candidate:Q S Deng

Full Text:PDF

GTID:2218330371453155

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The proliferation of Web information make the Web presence of the increasing variety of semistructured information. However, Web access to information that can be mostly semi-structured form of the html page structure, various types of applications can not be direct access to and use.Therefore, automatic extraction html page semi-structured data extraction technology, the Web has become a research hotspot. The researchers conducted an extensive study for Web information extraction, and there were many different principles of Web-based information extraction technology.According to actual demand, the author of this article on the news pages identify issues and failure detection problem Wrapper made a thorough research, and try to solve. This work with contributions mainly contains the following three aspects:1. Proposed a mechanism for identifying the author of news pages: The mechanism in reference plain text of the Chinese name identification method based on the combined characteristics of Chinese names, the context of news features and the author's web page structure, and the use of mutual information theory, this paper presents the news pages of Chinese authors identify the mechanism. 2. Proposed a failure detection mechanism Wrapper: Wrapper induction for information extraction-based approach is commonly used in real network applications network information extraction methods. We proceed from the actual application requirements, drawing on existing research results, made to meet the needs of the Wrapper failure detection mechanism. Wrapper of the mechanism by calculating the properties of extracted result set number of characteristic values to determine the probability of whether the failure Wrapper. Wrapper for the latter automatically provides the necessary basis for maintenance.3. In order to meet the needs of practical application, this web-based news and other Chinese authors identify the mechanism of the news pages of algorithm development, author, source, extraction components. The components of the late public opinion analysis provides important basic data. The components have been used in actual projects, and has achieved good results.

Keywords/Search Tags:

Web information extraction, Chinese personal name recognition, Wrapper, Wrapper failure detection

PDF Full Text Request

Related items

1	Research For Information Extraction Based On Wrapper Model Algorithm
2	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
3	Algorithm Research For Text Information Extraction Based On Wrapper Model
4	A Web News Extraction Method Based On Filtering Noise Wrapper
5	Application of wrapper methods to non-invasive brain-state detection: An opto-electric approach
6	Research On Wrapper Adaptation In Web Data Integration
7	A Domain Knowledge-based Personalized Comparison Shopping System: Design And Implementation
8	The Design And Implement Of Mediator And Wrapper Mechanism In Massive Multi-Database Intergration
9	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
10	Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies