Font Size: a A A

Research And Realization Of Web Information Extraction Based On DOM

Posted on:2009-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2178360272970449Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet, it has become an important channel for global information spreading and sharing. But with the explosive growth of data, it is more and more difficult to find interested information for users. Under the circumstances, how to extract useful information from Web has become a research focus. Various methods for information extraction have been proposed at home and abroad in recent years. These methods achieve good effect as a whole. However, the defects such as needing excessive sample pages and heavy workload still exist.Aiming to these shortages, a semiautomatic method for Web information extraction is proposed. The main content is shown as below.Firstly, a method combined URL comparison and Simple_Tree_Matching algorithm is used in order to solve the problem of similar pages acquirement. A Web crawler is utilized to obtain hyperlinks on the first step. Then the hyperlinks are filtered by the method of URL comparison and the ones which satisfy matching condition are left. On the last step the Simple_Tree_Matching algorithm is used to filter the result hyperlinks. The final similar pages can be acquired at this time. On the measurement of similar pages, not only the URL but also the concrete structure is considered, which makes up for the weakness of pure URL comparison.Secondly, a DOM based method is proposed. It extracts effective information by characteristic comparison of data items. The sample page is parsed and all characteristics of interested data items are saved at first. When the test page is inputted, the characteristics of data items users labeled and those from the test page are compared. The most similar ones in the test page are extracted as a result. Compared to the traditional DOM based method, it enhances the adaptability to the change of Web pages structure.A detection strategy is used to extract pages with multiple records thirdly. The similarity matrixes are calaulated between the labeled records and records in the test page. According to the change of matrixes, it can discover the boundary between records and then extracts all the records. The difficulty of extraction is reduced.Finally, according to the above analysis a Web information extraction system based on DOM is designed and realized. The system provides a fully visual and interactive user interface which is easy to operate. It finishes the extraction task by the combination of different fuction modules.Experiments on datasets IMDB, RISE and EXALG show that when it is trained by a single page, the proposed method can extract data from Web pages effectively. Even if some pages miss items sometimes, it still has a good performance.
Keywords/Search Tags:Web Information Extraction, DOM, Characteristic Comparison, Detection Strategy
PDF Full Text Request
Related items