Web Information Extracting Based On Tree Edit Distance

Posted on:2016-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:X B Liu

Full Text:PDF

GTID:2348330536955065

Subject:Computer technology

Abstract/Summary:

With the rapid development of Internet,shopping online has been widespread in the world.In China,online shopping malls just like Tmall,Jingdong develop dramatically.Most pages of e-commercial website contain products’ information and display as lists so that it is possible to extract large amount of products’ information from web pages.At first,this paper introduced some algorithms proposed before and analyzed their advantages and disadvantages.Then we proposed the algorithm of separating data records from those web pages to facilitate data integration and to provide value-added services.It consists of two steps:(1)finding the main data region;(2)dividing data records.For step 1,we use heuristic methods combined with visual information to find the root of the subtrees which contain all the similar data records.For step 2,we proposed a new tree edit distance clustering algorithm;we use this algorithm to reduce the number of the candidate lists,then we give an equation to compute the similarity.The candidate which has the highest similarity is the best dividing scheme.At last,considering that some data records may lose attributes in one page,this paper uses center string alignment algorithm to fix the lost attributes and extract the value of attributes as result.Experimental results using a large number of web pages from diverse domains show that the proposed method gets high recall ratio and precision ratio.

Keywords/Search Tags:

web information extraction, tree edit distance, main data region, data records, cluster algorithm

Related items

1	The Research Of Semi-structured Web Pages Information Extraction
2	Research On Automatic Web Information Extraction Technique
3	Study On ETL Technology Based On XML Data Resouces
4	Storage Optimization And Tree Vertical Merging Algorithm Of Tai Tree Editing Distance Algorithm
5	Table Information Extraction Based On Web Structure
6	Data Cleaning Algorithm And Applications
7	Research On Information Extraction Method For Retrieval Result Pages Of Oa Journals
8	Research On Information Extraction Method For Retrieval Result Pages Of OA Journals
9	Research Of Web Information Extraction Technology Based On Tree Structure
10	Algorithms Based On Visual Similarity Of The Research In Information Extraction And Implementation