Font Size: a A A

Research On Automatic And Efficient Technologies For Web Information Extraction

Posted on:2014-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y SongFull Text:PDF
GTID:1268330422952091Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The World Wide Web has become an important resource of information due to itsexplosive growth and spread in the past two decades. The tremendous amount of web datahas opened a new era for data analysis and mining systems. More and more web applica-tions need to extract, mine, and integrate data from enormous data sources. However, dueto the semi-structure characteristic of web pages, web data exhibited on web pages is notdirectly consumable by machines. Web information extraction aims at extracting struc-tured data from web pages, which is a very challenging problem due to the large-scaleand highly-heterogeneous characteristic of web information.Aiming at handling the large-scale and highly-heterogeneous characteristics of webinformation, this dissertation studies automatic and efcient technologies for web infor-mation extraction, conducted on two levels of data records and data units respectively.The research content includes:1. Targeting the high-heterogeneity of web information, a novel automatic datarecord extraction method called MiBAT (Mining data records Based on Anchor Trees) isproposed. Existing similarity-based automatic approaches cannot extract web data record-s accurately when a large amount of unstructured content exists (e.g., user-generated con-tent). This paper presents the concept of pivots, which correspond to some key data units.For example, almost every data record created and posted by users (e.g., online forumposts, user reviews, etc.) contains the publication date as a key data unit, which is apivot derived by domain constraints. The proposed MiBAT method detects pivots basedon domain constraints, identifies anchor trees that are DOM (Document Object Model)sub-trees containing the pivots, and finally extracts data records around the anchor treesautomatically. Experimental results show that, compared to existing approaches, MiBATis able to overcome the irregularity of data records caused by unstructured content, result-ing in high accuracy.2. Targeting the large-scale of web information on the level of data records, a fastand efcient anchor tree finding algorithm is proposed. In web mining community, thetraditional mining approach is to enumerate sub-trees in a top-down manner; followingthis approach, the time complexity of MiBAT is O(n2), where n is the number of nodeson the DOM tree of the web page. In this paper, a novel anchor tree finding algorithm is presented based on aggregating tag paths in a bottom-up fashion, which enables MiBAT torun in O(n log n) time. Experimental results demonstrate that the new method significantlyimproves the efciency of MiBAT while remaining high accuracy.3. Targeting the cross-domain high-heterogeneity of web information, the conceptof generic pivots is proposed. The concept of pivots origins from domain constraints,corresponding to some domain-dependent key data units. In real applications, diferentdomain constraints are required to be identified for diferent domains, which limits theapplicability of MiBAT to some extent. To resolve the domain dependency, this paperexpands the concept of pivots and proposes generic pivots. Experimental results suggestthat, when using generic pivots, MiBAT is applicable to diferent domains achieving highaccuracy, without any pre-defined domain constraints.4. Targeting the large-scale of web information on the level of data units, a fastand efcient method for DOM tree matching is proposed for data unit alignment andextraction. The most widely used tree matching algorithm runs in O(n2) time, whichis not appropriate for web-scale processing. This paper proposes a novel tree matchingmethod based on the longest common subsequence (LCS) of the tag path sequences ofDOM trees. By exploring the inherent sparsity of the LCS problem, the proposed treematching method runs in O(r log n) time, where r is the number of pairs of nodes thathave identical tag paths from the two trees; when the matching is sparse, r≈O(n),and the algorithm runs in O(n log n) time approximately. Extensive experimental resultsdemonstrate that, compared to the existing method, the proposed approach significantlyimproves the running efciency and also achieves similar tree matching results as well asdata unit alignment results.In summary, this dissertation presents technologies for automatic and efcient webinformation extraction on both levels of data records and data units, which can well handlethe large-scale and highly-heterogeneous characteristics of web information with boththeoretical and application value.
Keywords/Search Tags:Web Information Extraction, Wrapper, Template, DOM Tree Matching, TagPaths
PDF Full Text Request
Related items