Research And Realization Of Web Information Extraction Based On DOM

Posted on:2009-08-07

Degree:Master

Type:Thesis

Country:China

Candidate:M Li

Full Text:PDF

GTID:2178360272970449

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

As the rapid development of Internet, it has become an important channel for global information spreading and sharing. But with the explosive growth of data, it is more and more difficult to find interested information for users. Under the circumstances, how to extract useful information from Web has become a research focus. Various methods for information extraction have been proposed at home and abroad in recent years. These methods achieve good effect as a whole. However, the defects such as needing excessive sample pages and heavy workload still exist.Aiming to these shortages, a semiautomatic method for Web information extraction is proposed. The main content is shown as below.Firstly, a method combined URL comparison and Simple_Tree_Matching algorithm is used in order to solve the problem of similar pages acquirement. A Web crawler is utilized to obtain hyperlinks on the first step. Then the hyperlinks are filtered by the method of URL comparison and the ones which satisfy matching condition are left. On the last step the Simple_Tree_Matching algorithm is used to filter the result hyperlinks. The final similar pages can be acquired at this time. On the measurement of similar pages, not only the URL but also the concrete structure is considered, which makes up for the weakness of pure URL comparison.Secondly, a DOM based method is proposed. It extracts effective information by characteristic comparison of data items. The sample page is parsed and all characteristics of interested data items are saved at first. When the test page is inputted, the characteristics of data items users labeled and those from the test page are compared. The most similar ones in the test page are extracted as a result. Compared to the traditional DOM based method, it enhances the adaptability to the change of Web pages structure.A detection strategy is used to extract pages with multiple records thirdly. The similarity matrixes are calaulated between the labeled records and records in the test page. According to the change of matrixes, it can discover the boundary between records and then extracts all the records. The difficulty of extraction is reduced.Finally, according to the above analysis a Web information extraction system based on DOM is designed and realized. The system provides a fully visual and interactive user interface which is easy to operate. It finishes the extraction task by the combination of different fuction modules.Experiments on datasets IMDB, RISE and EXALG show that when it is trained by a single page, the proposed method can extract data from Web pages effectively. Even if some pages miss items sometimes, it still has a good performance.

Keywords/Search Tags:

Web Information Extraction, DOM, Characteristic Comparison, Detection Strategy

PDF Full Text Request

Related items

1	Application And Comparison Of Information Extraction And Change Detection Methods In Saline And Alkaline Land
2	Research And Implementation Of Intelligent Comparison Shopping On Internet
3	Design And Implement Of Software Defects Detection System Based On Source Code Homology Detection Technology
4	Research And Application On Automatic Comparison Of Text
5	Research Of Content-Based Video Retrieval Techniques
6	Research On Extracting Key Technology Based On The Wireless Channel Characteristic
7	Design And Implementation Of Sensitive Information Monitoring System For Website
8	A Domain Knowledge-based Personalized Comparison Shopping System: Design And Implementation
9	The Comparison Strategies And Reconfiguration Of Ring Oscillator Physicalunclonable Functions
10	The Research And Comparison On Characteristic Extraction Method For 2-D Skull Based On CT Image