Font Size: a A A

Research Of Extracting And Purifying Of Web Information

Posted on:2009-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y DongFull Text:PDF
GTID:2178360242480333Subject:Software engineering
Abstract/Summary:PDF Full Text Request
One important standard of Search Engine is the precision of the searching, and the clean content is the basis of Search Engine. As the base of Web information pretreatment, content extraction absorbs much more focus.There are many kinds of methods of information extraction, and the algorithm is much more complicated.Block-based content extraction is a method of parsing tag tree from bottom to top. It analyses each content block according the block's importance degree and the property value to extract the right one that has content. This method uses blocking, divids the Web information to different blocks which have different degree. It can check most part of the Web pages, but classifying the blocks is the difficult part of the application. It can be applied in many other field of Web Extracting just like extracting of the title information, reducing the Web viod and etc.Another information extraction adapts to the"content"Web pages."content"Web pages always contain big block of text, these blocks contain the major information that the Web pages want to express. This paper analyzes the structural characteristics of this kind of Web pages, and transformed the problem as: given an HTML source file of a"Content- Dominated"Webpage, to find the best range of the content main body. Presents an FFT- based extraction algorithm of webpage content main body.HTML-codes-density-based content extraction focus on the simple HTML codes. There are less HTML codes at the part of the content, but more codes at the part of Ads ,side bar, rim and navigation. It calculates the HTML codes density, and choose one fixed value for dividing point. If the density is less than the fixed value, the right part is confirmed as content part.Those mentioned content extraction method above can extract content precisely ,but the output is always only text. That is to say if the input is a HTML, the output is only text. To apply the algorithm, the text output is not enough. The users may concern about not only the text but also the link, the image or script.This paper proposes the DOM-tree-based content extraction is to parse the HTML to DOM tree. According the parameters that the user select, it check each DOM-tree nodes to record the node's information ,to modify the property of the node and to delete the node. Different parameters, different extracting information and extracting degree. It deparse the DOM-tree to HTML when it output the result. It doesn't destory the structure of the DOM-tree,so the output result is the one that is less different from the input at the structure, it keep the whole Web page style. In DOM-tree world, HTML document can be parsed to different sorts of nodes. Each node has it's own attribute and function which can be used to scan the whole document tree.This paper implements one project, GeneralExtractor, that bases on DOM-tree-based content extraction. First, GeneralExtractor parses input HTML to DOM tree, and then defines a few filters according to the DOM tree node's property. These set of filters delete the nodes which the users don't want just like a boult, the rest is the things which the users want. If the Web Browser can disply the result, Automatic extraction of useful and relevant content from web pages has many applications, ranging from enabling end users to accessing the web more easily over constrained devices like PDAs and cellular phones to providing better access to the web for the visually impaired.Compared to other methods which this paper introduced, DOM-tree-based method is different. It focuses on how to delete the nodes which the users don't want, and the rest is the right content. Actually, the diversification of the Web page structure and the Web information makes these algorithm to approach the content extraction but not the right one. To change the direction may help the result.This paper proposes a novel approach to reduce the noise content in Web pages. It uses a tree structure, DOM tree, to capture the common layout of the pages in a given Web site. It also introduces an entropy-based measure of the node in the DOM tree to reduce noisy blocks of the site. To caculate the entropy of each node according to the formula of entropy, estimates the noisy node according to the entropy. Although this paper implement the simply content extraction, General Extractor, there is a long road to take at part of DOM-tree, much more improvement and promotion should be made.The classification of the Web pages is another hotpoint of the Search Engine. It bases on the Text Classification, and there are some kind of method to classify the Web pages, Probabilistic algorithms and relational learningmethods and SVM (support vector machine) and etc. This paper proposes some opinion about the Web page classification. Beyond the key words, the tag content of HTML which can classify the pages ,just like the title, the page description and the hyperlink, should be considered when analyses the pages. Taking the key words and the useful tag content together would make the classification of the Web pages more efficient. It can be more perfect that the DOM-tree-based Content Extracting, increasing the number of the users or extracting granularity.DOM-tree-based Content Extracting can be applied in many other field just like title information, image information and etc. The tecnology of the DOM-tree can be more and more mature, and more and more academes use it to analyse Web information.
Keywords/Search Tags:Information
PDF Full Text Request
Related items