Research On Automatic And Efficient Technologies For Web Information Extraction

Posted on:2014-04-06

Degree:Doctor

Type:Dissertation

Country:China

Candidate:X Y Song

Full Text:PDF

GTID:1268330422952091

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The World Wide Web has become an important resource of information due to itsexplosive growth and spread in the past two decades. The tremendous amount of web datahas opened a new era for data analysis and mining systems. More and more web applica-tions need to extract, mine, and integrate data from enormous data sources. However, dueto the semi-structure characteristic of web pages, web data exhibited on web pages is notdirectly consumable by machines. Web information extraction aims at extracting struc-tured data from web pages, which is a very challenging problem due to the large-scaleand highly-heterogeneous characteristic of web information.Aiming at handling the large-scale and highly-heterogeneous characteristics of webinformation, this dissertation studies automatic and efcient technologies for web infor-mation extraction, conducted on two levels of data records and data units respectively.The research content includes:1. Targeting the high-heterogeneity of web information, a novel automatic datarecord extraction method called MiBAT (Mining data records Based on Anchor Trees) isproposed. Existing similarity-based automatic approaches cannot extract web data record-s accurately when a large amount of unstructured content exists (e.g., user-generated con-tent). This paper presents the concept of pivots, which correspond to some key data units.For example, almost every data record created and posted by users (e.g., online forumposts, user reviews, etc.) contains the publication date as a key data unit, which is apivot derived by domain constraints. The proposed MiBAT method detects pivots basedon domain constraints, identifies anchor trees that are DOM (Document Object Model)sub-trees containing the pivots, and finally extracts data records around the anchor treesautomatically. Experimental results show that, compared to existing approaches, MiBATis able to overcome the irregularity of data records caused by unstructured content, result-ing in high accuracy.2. Targeting the large-scale of web information on the level of data records, a fastand efcient anchor tree finding algorithm is proposed. In web mining community, thetraditional mining approach is to enumerate sub-trees in a top-down manner; followingthis approach, the time complexity of MiBAT is O(n2), where n is the number of nodeson the DOM tree of the web page. In this paper, a novel anchor tree finding algorithm is presented based on aggregating tag paths in a bottom-up fashion, which enables MiBAT torun in O(n log n) time. Experimental results demonstrate that the new method significantlyimproves the efciency of MiBAT while remaining high accuracy.3. Targeting the cross-domain high-heterogeneity of web information, the conceptof generic pivots is proposed. The concept of pivots origins from domain constraints,corresponding to some domain-dependent key data units. In real applications, diferentdomain constraints are required to be identified for diferent domains, which limits theapplicability of MiBAT to some extent. To resolve the domain dependency, this paperexpands the concept of pivots and proposes generic pivots. Experimental results suggestthat, when using generic pivots, MiBAT is applicable to diferent domains achieving highaccuracy, without any pre-defined domain constraints.4. Targeting the large-scale of web information on the level of data units, a fastand efcient method for DOM tree matching is proposed for data unit alignment andextraction. The most widely used tree matching algorithm runs in O(n2) time, whichis not appropriate for web-scale processing. This paper proposes a novel tree matchingmethod based on the longest common subsequence (LCS) of the tag path sequences ofDOM trees. By exploring the inherent sparsity of the LCS problem, the proposed treematching method runs in O(r log n) time, where r is the number of pairs of nodes thathave identical tag paths from the two trees; when the matching is sparse, r≈O(n),and the algorithm runs in O(n log n) time approximately. Extensive experimental resultsdemonstrate that, compared to the existing method, the proposed approach significantlyimproves the running efciency and also achieves similar tree matching results as well asdata unit alignment results.In summary, this dissertation presents technologies for automatic and efcient webinformation extraction on both levels of data records and data units, which can well handlethe large-scale and highly-heterogeneous characteristics of web information with boththeoretical and application value.

Keywords/Search Tags:

Web Information Extraction, Wrapper, Template, DOM Tree Matching, TagPaths

PDF Full Text Request

Related items

1	Research For Information Extraction Based On Wrapper Model Algorithm
2	Research And Application Of Automatic Data Extraction From Template-generated Web Pages
3	Research And Implementation Of Page Object Extraction Model For Vectical Search Engine
4	Research On Web Information Extraction Based On Script Code And Local Data Matching
5	Research Of A Suffix Tree Based Automatic Wrapper Generation Method
6	Tag Tree Template In The Pages Of Critical Information Extraction And Topic Identification
7	Information Extraction Algorithm Based On The Template Matching In Traffic Standards
8	The Literature Information Retrieval And Matching From The Web
9	Research Of Data Extraction Technology Based On Tag Tree From List Pages
10	Research And Implementation Of Automatic Information Extraction From Web Pages