Research & Application Of Web Similiarity Based On DOM Tree

Posted on:2012-09-18

Degree:Master

Type:Thesis

Country:China

Candidate:R X Zhang

Full Text:PDF

GTID:2178330335454632

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With rapid development of web information sources, how to select datas we need from large data sets is a challenging problem. Traditional tools of extracting web information are based on text matching, and they can not make precise comparisoin or selection. There has been a precise and effective method that information is extracted by mining web structural features. We can measure the similiarity between target information and sample informations, and then confirm the right ones by the similiarity.Generally, the theories of computing web similiarity based on DOM includes theory based on node statistic feature, theory based on root-to-leaf chain matching, theory based on minimal editing distance, theory based on simple tree matching and so on. However, there are some problems or flaws in all of these methods. Node statistic is not systemic, chain matching scattered, minimal editing distance lacking of understanding hierarchy, and simple tree matching strict in order. They are unsuitable for DOM information, and running with slow speed.In order to solve the above problems, this paper supposes a new DOM parsing method, algorithms of calculating the web similiarities based on DOM tree, and algorithms of extracting web information based on the similiarities. Detailed research work is as followed.(1) DOM parsing algorithm based on extracting partial dataParsing DOM tree is not only the basis of calculating web similiarities but also the premise of extracting web information. This paper proposes two DOM parsing algorithms. One is parsing DOM with normal order based on extracting partial data, and the other is parsing DOM with reverse order. Both can parse DOM from most of web pages.(2) Algorithms of web structual similiarities based on DOM treeWeb structural similiarity can not only measure similiarity between two web pages, but also quantize similiarity between informations in different parts of one web page. In this way we extract target informations. Different from trodational methods, this paper proposes two algorithms of measuring the web similiarities. One is a recursive algorithm based on optimal free matching for children trees, and the other is based on path pressed tree.(3) Web main texts extracting based on web structual similiaritiesGenerally, main texts in the same web page are similar in structure. And the similiarities provide us an idea for extracting main texts. We confirm one or two parts of main texts by some text feature, find the other parts by this similiarity, and then extract all of them. This paper uses the two above algorithms of measuring similiarities to extract web main texts.

Keywords/Search Tags:

Parsing DOM tree, Optimal free matching for children trees, Structual similiarity, Web information extracting

PDF Full Text Request

Related items

1	Research And Application Of Web Data Extraction Mode Based On Tree Structure
2	Extracting Optimal Explanations For Ensemble Trees VIA Logical Reasoning
3	A Method For Extracting The Topic Information In Webpages Based On The DIV Tag-Trees
4	The Research On Frequent Subtrees Mining And Corresponding Techniques
5	Research On Mongolian Dependency Parsing Based On The Conversion Of Chinese-Mongolian Dependency Parsing Tree
6	Extracting Trees From Lidar Data In Urban Region
7	Research On Key Technologies For 3D Reconstruction Of Fruit Tree's Stems
8	Models for improved tractability and accuracy in dependency parsing
9	The Study On Data Augmentation In Chinese Parsing
10	Dependency Syntax Analysis Based On Question Answering System