Font Size: a A A

Research On Automated Web Navigation And Data Integration Rules For Web Information Extraction

Posted on:2015-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:H T WangFull Text:PDF
GTID:2308330482978885Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the development of Internet, the Web has become the world’s largest information sources with huge amount of useful data available from web pages. However, how to accurately and effectively extract data from web pages is an important problem for web data applications. Web information extraction is such a research field to solve this problem. A lots of related research work have been done in past decade and many automated and semi-automated web information extraction techniques and methods have been proposed. But most of them focused on the automated analysis and data extraction for those pages that contain similar date records, ignoring or simplifying the phrase of web navigation and the phrase of data integration after data extraction. In addition, most of existing research on web navigation are relatively independent browsing tools. On one hand, they are not combined with data extraction, on the other hand, they just replay user’s navigation process once and cannot provide required workflow controls, hence, they cannot solve the real web information extraction tasks.For this reason, this paper aims to research the automated web navigation techniques for web information extraction. The major contributes and works in this paper are as follows:First of all, for the shortcomings of previous research works, this paper studies the model and rule system of web information extraction, which can support automated web navigation, data extraction and data integration. The proposed model and rule system can describe the three typical phrases in web information extraction.Secondly, this paper studies the navigation model and method in web information extraction and designs the rule language of web navigation to describe user’s interactions in the browser. The language can support a variety of navigation actions that are performed in the regular and AJAX pages and satisfies the requirement of parameterization (such as dynamic replacing the values in a form). Furthermore, this language can describe the page link relations among web pages.Thirdly, this paper studies the model and method of data integration and designs the data integration rule language. The integration rules can map source data to the target data structure. More importantly, a complex data record may be displayed in several related web pages. Hence, it needs to automatically navigate these related pages and maintain the right data relationship between these pages, in order to complete the data extraction and integration process.Fourth, in order to provide automatic workflow control, such as set different search words in the same search page to obtain different record pages, for web navigation, data extraction and integration, this paper also studies and designs the workflow control language for web information extraction. The language simplifies the existing control languages and is easy to be generated. Furthermore, the language can provide complex logic control ability to some extent.Fifth, based on above models and rule system, this paper designs and implements the execution engine for web navigation, date integration and process controls. The prototype system is designed and implemented to verify the effectiveness of proposed models, techniques and methods. User can generate the rules for web navigation, data extraction and integration, and workflow control logic using the prototype system. During the runtime, the system pre-compiles these rules and generates Java codes, then automatically execute the data extraction task. Moreover, this paper designs a reliable and efficient algorithm to generate the XPath to identify the navigation elements.To evaluate the validity of our prototype system, we first tested the navigation module of our system in both the recording and execution phases with a wide range of real web sites. Secondly, we compared our navigation approach with several widely used web automation tools obtaining very good results. Finally, the practical extraction examples have been done to test and verify the process of web information extraction. Experimental results show that the proposed rule languages and the implemented system can effectively complete web page navigation and data extraction.
Keywords/Search Tags:accurate web information extraction, deep web, web navigation, data integration, rule language
PDF Full Text Request
Related items