Font Size: a A A

Research And Application Of Web Data Extraction Mode Based On Tree Structure

Posted on:2012-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y S GaoFull Text:PDF
GTID:2178330335454427Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the Internet has become the important ways of global information dissemination and sharing, how to extract valuable information from huge web documents has become the focus. Web page is a kind of semi-structured data, which has more complex document structure and data form. Accordingly, if we want to extract useful data from it, we should use the method of extraction technology.Firstly, this paper reviews the research status of data extraction technology, discusses concepts and techniques of the data extraction process, mainly including SGML technology, XML technology and current multiple data extraction method, and analyzes the advantages and disadvantages of extraction method.Then according to the structure characteristics of the Web page, the paper analyzes the contribution of different positions of text page to data extraction, distinguishes effectively data and noise data. The paper use regular expressions principle to eliminate noise content of extract content retains the title, text, keyword, theme description, and related links text and other related content data contributions to the data extraction. On this basis, the paper introduces a design Web data extraction based on tree structure model, convert the pretreatment Web page to a DOM tree, then generates extract rules by user interactions. Then, the paper improved top-down trees matching algorithm to segment and match the DOM tree, extract useful data to the user, generate XML files, and map storage. On Web data pretreatment, the HTML documents are converted standardly used the principle of symmetrical binary tree.Finally, this paper uses the data extracting to the actual project "shipping configuring", use the system for background on the experiment data extraction model.Through experiment, the mode can achieve 93.6% precision and an average of 96.5% recall rate, compares to the classical XWrap system, the precision 95.5% and recall rate 92%, reflects the advantage of extraction data model.The application of this model to achieve the shipping information for ports and ship's information collection and data storage, provide accurate and valid data support for choosing ship of shipping configuring, automatic stow, automatic generation of shipping the stowage.
Keywords/Search Tags:Web Data Extracting, DOM Tree, Trees Matching, Selection Rules
PDF Full Text Request
Related items