WEB Information Extraction Based On Semantic DOM

Posted on:2013-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z Y Mo

Full Text:PDF

GTID:2248330371989013

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, the Web has become the worldâ€™s largest, distributed, shared information resource. Facing the largest information resources, how to obtain useful information has become the current problems which have to be solved. As a result, the search engine technology has fully developed. Due to the features of the Web page structure which is complex, heterogeneous, dynamic and open, it makes the current search engine retrieval performance unsatisfactory. In order to improve the retrieval performance in search engine technology, the introduction of data mining techniques and structured processing of Web pages have been carried out. Meanwhile, one of the important research problems in the Web page structure processing technology is the Web page information extraction.In this thesis, the Web page data has the characteristics of complexity and heterogeneity, the establishment of an automatic extraction of semantic DOM-based Web information technology has been completed. In the technology, the extracted template rules, the content extraction based on the DOM tree information and content extraction based on the semantics of the DOM have been well studied.Firstly, the page information extraction technology development history as well as the domestic and foreign research situation have been introduced. A comprehensive comparison between the listed typical web information extraction technologies has been followed. The advantages and disadvantages have been pointed out. Detailed introduction of the semantic tags, DOM model, XHTML theory and programming practice have been made at the end.In this thesis, information extraction technology is based on DOM (Document structure model) and tag semantics, where DOM is one standard of W3C, as well as being a tree data structure to describe web documents, providing a standard interface method on the page node. Label semantics is also a standard advocated by W3C of making use of a label, which makes the data in HTML pages identified and resolved by more software. Its implementation illustrates the meaning of the contained data by the use of labels. Next, we elaborate on the DOM (Document Structure Model) based on semantic information extracted architecture, design methods and processes. First the standardization of HTML has been discussed. DOM-based parser transfers the text of HTML or XHTML into the DOM tree of technology solutions to improve the extraction efficiency. Then through the template detection, and finally some branches of DOM tree has been cut off, noise been reduced according to the semantic labels and text weighted pruning, to form a pure DOM tree. As a result, the formatted useful information extracted from the DOM tree can be displayed to users.

Keywords/Search Tags:

Web data mining, DOM trees, semantic, noise data

PDF Full Text Request

Related items

1	Research And Applications Of Data Mining
2	Application Of Data Warehouse And Data Mining In Labour Resource Management
3	Super Data The Integrated Mining Method And Technology Research
4	Research Of Data Mining Algorithm Based On Association Rules
5	Research On Data Mining Of Cotton-spinning Quality
6	Data Mining And Its Applications In The Field Of Medicine
7	Based On Data Mining Technology, Crm Research
8	A Design And Implement Of Internet Intelligence Mining System About Semantic-based Information Extraction
9	Research Of The Decision Trees And It's Application
10	The Design And Implementation Of A Semantic Web System Based On Data Mining