Font Size: a A A

Research On Web Information Extraction Technology Based On Information Entropy

Posted on:2014-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2248330398957586Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, a huge number of web documents have been produced, and this number is still increasing in a rapid growth. However, because of the heterogeneity of the Web information and dynamic changes in their structure, with such a huge Web resources, it is often too hard for users to quickly capture the useful information. How to timely and accurately filter information and extracts useful information or knowledge for users from such a huge source of network information, and forming them into a unified knowledge database for querying and retrieving has become an important topic in the study of artificial intelligence and the Internet. The task of Web information extraction is to extract information interest to the users from Web documents.Web information extraction uses the semi-structured Web documents as its input, and extract information users needed from the vast amounts of disordered information, then put them into the database in a structured format so that users can search and analysis. Since noisy information has been removed from the extracted useful information, when they are regarded as input source for the systems like web documents classification and clustering, information retrieval, question answering system and Web data mining, it will effectively improve the performance of these systems.The Web document has provided users with a large number of information, mixed with a lot of noisy information, such as the hidden information which is automatically generated by the machine and manually added redundant information, and only a part of information or the core information is interested to users. A lot of noisy information has brought some difficulties for accurately extracting information from the Web document. In this article, the Web document is divided into three parts:the core information, the redundant information and the hidden information, so Web information extraction is converted to the problem of removing noisy information from the web page, including the redundant information and the hidden information.In this article, we propose an information entropy based Web information extraction method, which makes use of the distribution characteristics of information in a web page presented in the web page set, combined with the structure of DOM tree and statistical theory, it can automatically identify the noisy information and retain the key information. The method parses the web page into a DOM tree to remove the hidden information, after the text of the leaf nodes are segmented, the distribution information of keywords is recorded and exploits Mean Entropy Criteria and Joint Entropy Criteria this article proposed to calculate the mean entropy and joint entropy of leaf nodes, so the ADMJ (The Absolute Difference between Mean Entropy Criteria and Joint Entropy Criteria) value of leaf nodes is computed; then a block aggregation operation was made for leaf nodes according to the structure of DOM tree, the ADMJ value of labe <body> is recursively calculated as a threshold to distinguish noises from non-noises. In order to verify the effectiveness of the proposed method, we conducted experiments on some famous websites. Experimental results show that the proposed method achieves better performance than other state-of-the-art methods.
Keywords/Search Tags:Web Information Extraction, Information Entropy, the DOM tree
PDF Full Text Request
Related items