Font Size: a A A

Research On WEB Topic Information Extraction Based On DOM Tree Node Importance

Posted on:2017-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:J N MaFull Text:PDF
GTID:2358330503983645Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the data that WEB carries is growing day by day. Problems are becoming more and more prominent which include information redundancy, various forms, dealing with difficulties and so on. Therefore web information extraction research came into being. A large number of irrelevant information is contained in the WEB page, which affects the user to locate and get the topic information quickly. So, it is more important to extract the topic information from the page. It can not only save a lot of time and energy for users, but also the results can be used in data mining and other aspects.Web information extraction is used mainly for unstructured or semi-structured WEB pages. And the mainstream is mostly based on HTML structure. In the existing researches, when the researchers focused on the structure of HTML tags, they ignored the semantic information of HTML tags, or ignored the impact of the semantic information of tags on the text they contain. Considering the structure, semantic meaning of tags and the influence of semantic meaning on the text, a method of WEB topic information extraction is proposed in this paper, which is based on the importance of DOM tree nodes. The research work includes the following:(1) The importance of DOM tree nodes is introduced. Due to the structure and semantic information of tags has a relationship, comprehensive consideration of the two, the tags are divided into different categories. Corresponding to the DOM tree node classification, it mainly includes the block nodes, line nodes, visual nodes, link nodes, text nodes and other nodes. Taking into account the impact of each type of node on the topic information, when setting the corresponding impact factors for different types of tags, the importance of nodes is defined to unify the impact of DOM tree nodes on the topic information.(2) The extended DOM tree model is proposed. In order to prevent the DOM tree of the meticulous treatment, extended DOM tree model is simplified, retaining only block nodes which can carry the topic information. In the process of merging non-block nodes into block nodes, the importance of nodes is modified. Considering the influence of semantic information of the tags on the text, when the different types of nodes are combined, the calculation methods of the node importance are different. Once the merger is completed, an extended DOM tree model with node importance is obtained.(3) The method of WEB page topic information extraction based on extended DOM tree model is presented. It includes four steps: cleaning page, building an extended DOM tree, de-noising for extended DOM tree and extracting the topic information. Among them, according to the importance of node, we remove the noise from the extended DOM tree by setting a threshold value of node importance. Finally, we implement a prototype and analyze to select appropriate node important thresholds through experiments, verify the effectiveness of the proposed method and prove that the method has good extraction effect.
Keywords/Search Tags:WEB Information Extraction, Extended DOM Tree, Importance of Node
PDF Full Text Request
Related items