| With the development of the technology, popularizing rate of the computer is increasing, more and more people browse information on Internet.Today, people use the Internet in living, work and business activities, web has become an important way of people obtaining information. Web pages contain text, images, videos, music and so on. Different people like different web information, the information that people are not interested in scatter around the information that people are interested in, they distract from people's attention, it is inconvenience to reading web information.The paper presents a DOM-based Web information extraction methods, the way can filter out the information that people are not interested in the web pages, leaving only the information that the people are interested in.This method is not mechanical to find the information that we are interested in, but delete the information that we are not interested in. First, we use the Eclipse development tools, use HTML parser NekoHTML of open source parse web pages to a DOM tree. The paper use depth-first search algorithm to recursively traverse every node of the DOM Tree to determine whether the node contains the information that we are interested in. We preserve the node that contains the information that we are interested in, we delete the node that contains the information that we are interested in. The paper use java programming language to implement extraction algorithms of web information, use the JSP and Servlet to develop graphical user interface. The paper use extraction algorithm to delete the information that the user is not interested in and retain only the information that he user is not interested in.Users can choose their favorite information by graphical interface, our extraction algorithm will be based on the user's choice, to delete the information that users are not interested in, to return the information they like. The paper first introduces the purpose of studying the Web information extraction tools, and then analyze the advantages and disadvantages of 11 types of Web information extraction technology, introduces the web page type and web page composition, and then introduces the DOM tree and the open source web analytic tools NekoHTML, the final design Web information extraction algorithms, complete implementation of Web information extraction tools. |