Font Size: a A A

Research Of Web Information Extraction Technology Based On Tree Structure

Posted on:2011-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:C Y LianFull Text:PDF
GTID:2178330338976263Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, Web has become a large and complex knowledge base, the research of how to extract information from Web becomes more and more important. There is an important class of Web pages called data-oriented page, this class of page is generated dynamically and can be updated easily. How to extract information from this kind of page is the focus of Web information extraction technology research.On the basis of previous research results and the study of information extraction theory, this paper presents a new Web information extraction method based on tree structure to extract information from this class of page, and the following parts show the works the paper does around this method. Firstly, a new method to convert HTML into XML is proposed. As the pre-processing module of Web information extraction, the transformation from HTML file to XML file plays a key role in Web information extraction. The convertion method this paper presents is based on binary tree, can handle three typical errors in HTML. Secondly, a novel measure to fix the position of data records is brought forward. The measure will fix the position of data records by three steps. At first, main content areas will be found out according to the out-degree of every node. And then, all data regions in every main content area will be picked out, in this step a tree matching algorithm based on weight called STMCTN is proposed, and accordingly, algorithms, such as calculating the similarity of trees, are also improved. At last, the data records will be sought out from every data region. The effectiveness and accuracy of this measure is demonstrated by results of experiments. Thirdly, this paper presents an effective approach to align data attributes. After all the data records are found, an approach to compare every data record of the same class and align all the data attributes of them is needed. The approach this paper proposes to arrive this object is based on cluster and tree alignment, it avoids the phenomenon that the results of alignment take too many redundant. The performance of these proposed methods are analysed through experiments.
Keywords/Search Tags:Web Information Extraction, Data-oriented Pages, Binary Tree, Data Record Position, STMCTN Algorithm, Hierarchical Clustering, Alignment of Data Attributes
PDF Full Text Request
Related items