Font Size: a A A

Research And Application Of Web Pages Denoising And Information Extraction Algorithm

Posted on:2014-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z K ShaoFull Text:PDF
GTID:2268330425476515Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development and popularization of the Internet, more and more people rely on getting information from the network. However, in order to safeguard the commercial interests and the promotion needs, the Web page is filled with a lot of noise which seriously interfere with people’s access to information. So this paper presents a DOM tree based web information extraction method.Through the analysis of web denoising and information extraction, finding that some methods based on DOM tree can’t deal with the Web pages that not contain hyperlinks or the theme distributs in the DIV tags. To solve the problem as following:1. Use the VIPS method effectively divide page into information blocks and noise blocks.2. Transformed the divided block into DOM tree structure. The VIPS method delineated each block is a tree structure, divided the web page into the more fine-grained.3. Use the recursive fashion to extract information from the tags present in the DOM tree. Effectively solved the theme exists in the TABLE and DIV tags. At the same time, extract the content accordding to the similarity between the text and the co-occurrence frequency between title and the words in node. Given the title word a larger weight and given the word in article a smaller weight when calculate the frequency of page title word of node co-occurrence. Effectively improve the accuracy of the information extraction.Finally, simplely realized a system based on the knowledge of the JTidy and crawler. Add the URL meetting the conditions to the queue to be extracted according to the correlation of the URL and Subjects. Download web pages satisfy the condition according to the similarity of the theme and content of the body of the page with the news category. Extract the headline, content, time, and other relevant information from the news pages and save to the database. Through the test of Web information extractiont,indicating that the the algorithm is effective.
Keywords/Search Tags:Denoising pages, DOM, Information Extraction, the algorithm of VIPS, tags
PDF Full Text Request
Related items