Font Size: a A A

Research & Application Of Web Similiarity Based On DOM Tree

Posted on:2012-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:R X ZhangFull Text:PDF
GTID:2178330335454632Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With rapid development of web information sources, how to select datas we need from large data sets is a challenging problem. Traditional tools of extracting web information are based on text matching, and they can not make precise comparisoin or selection. There has been a precise and effective method that information is extracted by mining web structural features. We can measure the similiarity between target information and sample informations, and then confirm the right ones by the similiarity.Generally, the theories of computing web similiarity based on DOM includes theory based on node statistic feature, theory based on root-to-leaf chain matching, theory based on minimal editing distance, theory based on simple tree matching and so on. However, there are some problems or flaws in all of these methods. Node statistic is not systemic, chain matching scattered, minimal editing distance lacking of understanding hierarchy, and simple tree matching strict in order. They are unsuitable for DOM information, and running with slow speed.In order to solve the above problems, this paper supposes a new DOM parsing method, algorithms of calculating the web similiarities based on DOM tree, and algorithms of extracting web information based on the similiarities. Detailed research work is as followed.(1) DOM parsing algorithm based on extracting partial dataParsing DOM tree is not only the basis of calculating web similiarities but also the premise of extracting web information. This paper proposes two DOM parsing algorithms. One is parsing DOM with normal order based on extracting partial data, and the other is parsing DOM with reverse order. Both can parse DOM from most of web pages.(2) Algorithms of web structual similiarities based on DOM treeWeb structural similiarity can not only measure similiarity between two web pages, but also quantize similiarity between informations in different parts of one web page. In this way we extract target informations. Different from trodational methods, this paper proposes two algorithms of measuring the web similiarities. One is a recursive algorithm based on optimal free matching for children trees, and the other is based on path pressed tree.(3) Web main texts extracting based on web structual similiaritiesGenerally, main texts in the same web page are similar in structure. And the similiarities provide us an idea for extracting main texts. We confirm one or two parts of main texts by some text feature, find the other parts by this similiarity, and then extract all of them. This paper uses the two above algorithms of measuring similiarities to extract web main texts.
Keywords/Search Tags:Parsing DOM tree, Optimal free matching for children trees, Structual similiarity, Web information extracting
PDF Full Text Request
Related items