Font Size: a A A

Research On Interactive Web Data Extraction Based On Tree Matching

Posted on:2006-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z W QuFull Text:PDF
GTID:2178360182976235Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Traditionally, users retrieve Web data by browsing and keyword searching, whichare intuitive forms of accessing data on the Web. However, these search strategiespresent several limitations as follows. Browsing is not suitable for locating particularitems of data. Keyword searching is sometimes more efficient than browsing, butoften returns vast amounts of data, far beyond what the user can handle. Various newWeb applications in e-business, such as monitoring stock market, online pricecomparisons, require much more than browsing and keyword searching. As isdifferent from the traditional information retrieval techniques, Web data extractionmethods are inclined to find out Web documents user concerned from the documentscollections and extract structured data from the documents discovered.A large amount of information on the Web is stored in hidden databases. Suchinformation is dynamically generated in response of the users' query. There is highstructural comparability between the HTML codes of Web data rows. Naturally, thestructures of sub DOM trees are similar to each other.An approach based on TOP-DOWN sub-tree matching algorithm for interactivequery-related Web data extraction is represented here;DOM tree is basis of theapproach for analysising and extracting data from HTML. The one-to-one relationshipbetween Web entity and sub DOM tree is the gist of the approach. Data extractingrules are generated by interactive manner, which function is improved by usingarithmetic of multiple sequence alignment. Sub-tree dividing method combined witharithmetic of TOP-DOWN tree matching is used to discovering and extracting dataentities from DOM trees.The prototype was implemented based on component system developing mode.The result of the experiment shows high accuracy in terms of recall and precision.
Keywords/Search Tags:Web Data Extraction, TOP-DOWN Tree Matching, DOM, Component System
PDF Full Text Request
Related items