Research And Application Of Web Pages Denoising And Information Extraction Algorithm

Posted on:2014-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:Z K Shao

Full Text:PDF

GTID:2268330425476515

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development and popularization of the Internet, more and more people rely on getting information from the network. However, in order to safeguard the commercial interests and the promotion needs, the Web page is filled with a lot of noise which seriously interfere with people’s access to information. So this paper presents a DOM tree based web information extraction method.Through the analysis of web denoising and information extraction, finding that some methods based on DOM tree can’t deal with the Web pages that not contain hyperlinks or the theme distributs in the DIV tags. To solve the problem as following:1. Use the VIPS method effectively divide page into information blocks and noise blocks.2. Transformed the divided block into DOM tree structure. The VIPS method delineated each block is a tree structure, divided the web page into the more fine-grained.3. Use the recursive fashion to extract information from the tags present in the DOM tree. Effectively solved the theme exists in the TABLE and DIV tags. At the same time, extract the content accordding to the similarity between the text and the co-occurrence frequency between title and the words in node. Given the title word a larger weight and given the word in article a smaller weight when calculate the frequency of page title word of node co-occurrence. Effectively improve the accuracy of the information extraction.Finally, simplely realized a system based on the knowledge of the JTidy and crawler. Add the URL meetting the conditions to the queue to be extracted according to the correlation of the URL and Subjects. Download web pages satisfy the condition according to the similarity of the theme and content of the body of the page with the news category. Extract the headline, content, time, and other relevant information from the news pages and save to the database. Through the test of Web information extractiont,indicating that the the algorithm is effective.

Keywords/Search Tags:

Denoising pages, DOM, Information Extraction, the algorithm of VIPS, tags

PDF Full Text Request

Related items

1	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
2	Research Of Web Information Extraction Method Based On Multi-feature Mining
3	Based On The Key Pages Of Information To Improve The Hits Algorithm, And Location Information Extraction Method
4	Web Topic Information Extraction System Design And Implementation
5	Research On Deep Web Information Extraction Based On Visual Block And Semantic DOM
6	Research On Technique Of Self-adaptive Web Data Extraction
7	The Research Of Semi-structured Web Pages Information Extraction
8	Reserch And Implementation Of Webpage Cleaning Algorithm Based On Visual Information
9	Research On Page Segmentation Based On CEF
10	Research Of Automatic Metadata Extraction From Template Web Pages