Research Of Web Information Extraction Technology Based On Tree Structure

Posted on:2011-03-27

Degree:Master

Type:Thesis

Country:China

Candidate:C Y Lian

Full Text:PDF

GTID:2178330338976263

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet, Web has become a large and complex knowledge base, the research of how to extract information from Web becomes more and more important. There is an important class of Web pages called data-oriented page, this class of page is generated dynamically and can be updated easily. How to extract information from this kind of page is the focus of Web information extraction technology research.On the basis of previous research results and the study of information extraction theory, this paper presents a new Web information extraction method based on tree structure to extract information from this class of page, and the following parts show the works the paper does around this method. Firstly, a new method to convert HTML into XML is proposed. As the pre-processing module of Web information extraction, the transformation from HTML file to XML file plays a key role in Web information extraction. The convertion method this paper presents is based on binary tree, can handle three typical errors in HTML. Secondly, a novel measure to fix the position of data records is brought forward. The measure will fix the position of data records by three steps. At first, main content areas will be found out according to the out-degree of every node. And then, all data regions in every main content area will be picked out, in this step a tree matching algorithm based on weight called STMCTN is proposed, and accordingly, algorithms, such as calculating the similarity of trees, are also improved. At last, the data records will be sought out from every data region. The effectiveness and accuracy of this measure is demonstrated by results of experiments. Thirdly, this paper presents an effective approach to align data attributes. After all the data records are found, an approach to compare every data record of the same class and align all the data attributes of them is needed. The approach this paper proposes to arrive this object is based on cluster and tree alignment, it avoids the phenomenon that the results of alignment take too many redundant. The performance of these proposed methods are analysed through experiments.

Keywords/Search Tags:

Web Information Extraction, Data-oriented Pages, Binary Tree, Data Record Position, STMCTN Algorithm, Hierarchical Clustering, Alignment of Data Attributes

PDF Full Text Request

Related items

1	Research On Techniques Of Automatic Data Record Analysis And Recognition For Accurate Web Information Extraction
2	Research On Efficient Web Data Extraction Technology Based On Visual Information
3	Research Of Data Extraction Technology Based On Tag Tree From List Pages
4	The Research Of Semi-structured Web Pages Information Extraction
5	Research On Deep Web Data Acquisition Based On Visual Information And DOM Tree
6	Research And Application In Automatic Data Extraction From WEB Pages
7	A Dynamic Learning Framework To Automatically Extract Structured Data From Web Pages Without Human Efforts
8	The Study And Application Of Web Text Data Mining Technology Based On The Approximate Pages Clustering Algorithm
9	A Research Of Multi-source Character Attributes Data Fusion
10	Research And Application Of Automatic Data Extraction From Template-generated Web Pages