| With the rapid development of the Internet, the Internet in our daily life play more and more important role and become an essential part in our life. The rapid development of network brings the explosive growth of information. HTML structure, as the main carrier of Web information, is also more and more complex. But the information on the web is half structured or unstructured, mainly in form of HTML, cannot be directly used for analysis.Web information extraction according to the automation degree can be divided into:manual extraction, semi-automatic extraction, automatic extraction. Generally speaking, automatic extraction process as follow:1) first, converted the pages into the DOM structure, clustering similar web page based on DOM structure;2) accepting two similar page, one as sample, and the other as a page wrappers generating the corresponding wrappers;3) complete extracting through the wrappers. However in reality, the structure of the web page changes quickly, The methods of the DOM clustering have accuracy problem, and when the level of DOM structure changes, wrappers often appear failure. In the proposed method, we convert the script code embed in web page into Controlling Code Model(CCM) tree and give the different weights to the script nodes according to its importance,after that,compute tree edit distance by means of the dynamic programming algorithm. This paper has the following features:we put forward the definition and the conception of CCM tree for the first time, compared with traditional methods based on DOM tree, CCM tree can reflect the features of approximate web page more accurate; It’s proved that the computing disordered tree edit distance is NP complete problems, this paper compute tree edit distance in polynomial time according to the characteristic of CCM tree.In the experiment, we select10web sites which Google Pagerank is up to6as our experiment data sources, its covers the commercial web sites, web portal, or a non-profit site and so on. The experimental results show that the proposed method is better than traditional methods in the aspect of costing timeã€accuracy and robustness. |