Font Size: a A A

Research On The Web Structure Data Extraction Based On The Browser And Its Implementation

Posted on:2011-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:L L FuFull Text:PDF
GTID:2178360305954660Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the technology of the Internet development, the information on the internet is growing rapidly, the internet has grown to a dynamic all over the world distributed information server, which contains all kinds of information and resources, providing a variety of services and information resources for user and enterprises. Large amounts of data are queried from the database, and then use a certain template, displayed on the web page, generally this kind of data are referred as structure data or record. How to extract structure data from web has been researched by a lot of researchers,such as extraction based on Natural Language Processing,extraction based on DOM Tree of the web page, but these extraction is based on the single page, and those ways have a lot of shortcoming:1) the same topic information may in multiple pages, and the extraction must be on multiple pages, so we need the integration after extraction to generate a complete record; 2) web crawler has limited capabilities to the deep web; 3) the methods have limited capabilities to Javascript and AJAX. The paper gives an extraction method based on browser, which can give a big help to solve the problems. The method combines more information, and give different positioning strategies for user to select according the context of the extraction.The paper gives the following ideas to solve the extraction problems:1,Provide an interactive and visualization tools for extraction rules generation. The user needs little interactive action to generate extraction rules, which can applied to the same theme information over the entire site. The tools providers a variety of alternative methods, so as to the user to select appropriate method according the context.2,Provide an information locator method, which combines the path of the DOM tree, visualization and immutable text information. The paper provider a description of the DOM tree path called EPath(Extraction Path), which contains the position of the node,the attributes of the node and the visualization information. We all give the parsing algorithm, which is not like the parser of the XPath, a fall fast method, it scores the located node according the match degree with the EPath information mentioned above. The algorithm can solve the structure data has the same template but include optional date items problems.3,Provide a browser navigation based technology, to support form submission, repeated structure identification and next page devices identify, which solve the Deep Web Extraction, javascript, and AJAX handling restrictions.4,Define a complex extraction instructions, which provider the DSL(Domain Specific Language) to the extraction fields, and can solve the complex tasks easily.5,Build an extraction system that can be used as a information locator tools in vertical search engine, information integration system.The process of the extraction can be described as the following phases:1,Users use the extraction rule generation tools to demonstrate how to extract the record, and the extraction rule generated automatically, at last save the xml format extraction rules.2,The extractor executes extraction using the extraction rules generated from the first phase, and save the extracted results to file or database.Information locator algorithm is a key information extraction technology, so firstly, the paper describes the location algorithm. The EPath give a method to describes the location of the extracted information, which also combines the location,attribute,visual information. The extraction generation Tool using Firefox extension technology, and the EPath can be automatically generated when the tool generate the extraction rules. In the runtime, the extractor interpret the EPath and locate the target node, the EPath solve the structure data that has the same template but include optional date items problems, providing a robust web information location method. Interactive extraction rule generation tool is an import part of the extraction system. We customize the Firefox browser to make it suit to extraction interactively through the Firefox extension technology, and make the interactive operation just like to surf the internet, which make the tool easy to use. How to identify and submit the form, how to identify the repeated structure and pagination device, and the embedded browser technology provide a foundation to the browser navigation based extraction. We can use the information generated by the interactive extraction tools to identify and submit the form; The paper use a similarity of structure algorithm to identify the repeated structure, the algorithm use the string editing distance to define the similarity. Pagination device identification use heuristic rule-based approach: we define four rules to identity different kinds of pagination devices. The extraction runtime use embedded Firefox browser to navigate and extract information, and it make the use of extraction rules consistent with the generation, and it also has the ability to deal with the Javascript and AJAX. We define the extraction instructions and the logic instructions using EMF technology and XML format, which make the instructions scalable and flexibility. The extraction instructions defined an extraction domain language, which give a strongly support for extraction.After introducing the algorithm and principle, we describe the structure of the extraction system and the key modules in the chapter 5 and 6, and we also give the code and comment for important part of the system to make the description easily. The extraction system contains two parts: the visualized extraction rule generation tools and extraction runtime. The extraction rule generation tools use Firefox extraction technology, providing an interactive way to generate extraction rules. The tools can be divided into basic services layer and interactive UI layer: the basic services layer define several XPCOMs, which provide a basic service to the UI layer. The interactive UI layer provide a tool bar and a popup menu for extraction generation. The tool bar has several buttons such as load schema, watch model, save model; the popup menu is the primarily means for the interactive extraction operation, and all the extraction operations are defined in the popup menu. The extraction runtime use the embedded browser, but we define a abstract layer for browser and DOM operation, and it use adapter design pattern to avoid the specific browser API polluting the extraction code. We can replace the browser with the HtmlUnit for Web page database based extraction, which can avoid the render phase and improve the performance.The contribution and innovative work of the paper mainly in the following areas:1,Provide an interactive visualized extraction rule generation tools, which based on the Firefox extraction technology. It simplifies the complexity and reduce the threshold for taking the use of tools.2,Provide the EPath based and immutable text extraction algorithm. The EPath provides a robust web-node positioning method that combines the node's attributes, position information and visualized information.3,Provide the browser-based web page navigation extraction method, and the algorithm of identifying and submitting the form, identifying the repeated structure and pagination devices.4,Define complex extraction instructions, which describe the operation of the web information extraction, and some logic instructions. The instructions define a web information extraction domain language.5,Build a practically extraction system, which solve the problem of extraction field such as dealing with the Javascript,Ajax,the same theme information across multiple pages and deep web extraction.In summary, the paper presents a browser-based Web information extraction method, and give a try to build the practically web information extraction tools, and the recall and precise is good enough for practical using. The system given by this paper can be used as the web information locating tool for vertical search engine and information integration.Although some advances have been achieved in the domain of web information extraction, there is much of work to be done because of the limited research time and the author's ability. To be more specific, we list those that need to be improved and worth to research:1,The EPath location method depends on a lot of the structure information of the web page, so it can firstly partition the page to blocks, and then location the record block, and finally use the relative EPath to locate the node on the block.2,A more convenient extraction rule generation tool, and more interactive ways.3,Pagination recognition based on the heuristic rules, but the rules can not cover all the cases, therefore select the appropriate feature, and use machine learning approach may give help.4,The identification of the repeat structures uses similarity algorithm can combine the visual information such as the block size and alignment information, which may give a greater efficient and accurate method.
Keywords/Search Tags:Information Extraction, Web mining, Web Structure Data, Deep Web, Browser Navigator, Firefox Extension
PDF Full Text Request
Related items