Font Size: a A A

Web Information Extracting Based On Tree Edit Distance

Posted on:2016-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:X B LiuFull Text:PDF
GTID:2348330536955065Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,shopping online has been widespread in the world.In China,online shopping malls just like Tmall,Jingdong develop dramatically.Most pages of e-commercial website contain products' information and display as lists so that it is possible to extract large amount of products' information from web pages.At first,this paper introduced some algorithms proposed before and analyzed their advantages and disadvantages.Then we proposed the algorithm of separating data records from those web pages to facilitate data integration and to provide value-added services.It consists of two steps:(1)finding the main data region;(2)dividing data records.For step 1,we use heuristic methods combined with visual information to find the root of the subtrees which contain all the similar data records.For step 2,we proposed a new tree edit distance clustering algorithm;we use this algorithm to reduce the number of the candidate lists,then we give an equation to compute the similarity.The candidate which has the highest similarity is the best dividing scheme.At last,considering that some data records may lose attributes in one page,this paper uses center string alignment algorithm to fix the lost attributes and extract the value of attributes as result.Experimental results using a large number of web pages from diverse domains show that the proposed method gets high recall ratio and precision ratio.
Keywords/Search Tags:web information extraction, tree edit distance, main data region, data records, cluster algorithm
PDF Full Text Request
Related items