Font Size: a A A

WEB Information Extraction Based On Semantic DOM

Posted on:2013-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y MoFull Text:PDF
GTID:2248330371989013Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the Web has become the world’s largest, distributed, shared information resource. Facing the largest information resources, how to obtain useful information has become the current problems which have to be solved. As a result, the search engine technology has fully developed. Due to the features of the Web page structure which is complex, heterogeneous, dynamic and open, it makes the current search engine retrieval performance unsatisfactory. In order to improve the retrieval performance in search engine technology, the introduction of data mining techniques and structured processing of Web pages have been carried out. Meanwhile, one of the important research problems in the Web page structure processing technology is the Web page information extraction.In this thesis, the Web page data has the characteristics of complexity and heterogeneity, the establishment of an automatic extraction of semantic DOM-based Web information technology has been completed. In the technology, the extracted template rules, the content extraction based on the DOM tree information and content extraction based on the semantics of the DOM have been well studied.Firstly, the page information extraction technology development history as well as the domestic and foreign research situation have been introduced. A comprehensive comparison between the listed typical web information extraction technologies has been followed. The advantages and disadvantages have been pointed out. Detailed introduction of the semantic tags, DOM model, XHTML theory and programming practice have been made at the end.In this thesis, information extraction technology is based on DOM (Document structure model) and tag semantics, where DOM is one standard of W3C, as well as being a tree data structure to describe web documents, providing a standard interface method on the page node. Label semantics is also a standard advocated by W3C of making use of a label, which makes the data in HTML pages identified and resolved by more software. Its implementation illustrates the meaning of the contained data by the use of labels. Next, we elaborate on the DOM (Document Structure Model) based on semantic information extracted architecture, design methods and processes. First the standardization of HTML has been discussed. DOM-based parser transfers the text of HTML or XHTML into the DOM tree of technology solutions to improve the extraction efficiency. Then through the template detection, and finally some branches of DOM tree has been cut off, noise been reduced according to the semantic labels and text weighted pruning, to form a pure DOM tree. As a result, the formatted useful information extracted from the DOM tree can be displayed to users.
Keywords/Search Tags:Web data mining, DOM trees, semantic, noise data
PDF Full Text Request
Related items