Font Size: a A A

The Research Of Semi-structured Web Pages Information Extraction

Posted on:2012-08-01Degree:MasterType:Thesis
Country:ChinaCandidate:M L ZhuFull Text:PDF
GTID:2178330338493796Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, the network has become an important platform for people to issue and access information. Currently, the number of resources accessible in dynamically generated Web sites resulting from underlying databases grows dynamically, such as the list of goods pages. These pages are generally obtained by a particular query interface, and they are difficult to use for other applications directly. Therefore, how to extract information from these pages becomes very important. This paper studies how to extract information from these semi-structured pages.Introduced the concept of Web information extraction, the main problems and related technologies, analyzed the Web information extraction algorithms and the advantages and disadvantages. To solve the existing methods can not find the main data region very accurate problem, combined three heuristic methods which are highest fanout, largest size increase and largest tag count to find the main data region. When dividing the data records, all existing methods need to compute the similarity of subtrees, the efficient of these algorithms is not very well, to solve this problem, proposed the tree edit distance clustering algorithm, the clustering algorithm reduced the number of comparisons subtree to improve the efficiency of the algorithm. At the same time using the tree edit distance to represent the similarity of subtrees, the algorithm gets a higher accuracy. After clustering obtain the candidate divide lists, gives a formula to choose the best segment scheme. Used master alignment algorithm to extract the attributes of the data records.Experiments show that our method has a higher degree of automation and higher efficiency. Through the real web pages test, our method has a higher accuracy.
Keywords/Search Tags:Web Page Information Extraction, data records, tree edit distance, cluster algorithm
PDF Full Text Request
Related items