With the development of Internet, more and more people pay attention to the information on web pages, so information extraction from the web pages has become one of the research hotspots in the field of data mining. But web pages often contain many clutters (such as pop-up ads, unnecessary images and extraneous links) that are unrelated to the subject and affect the extraction of useful information. So web page cleanup becomes very important. On the basis of deep analyses and research on the data structure of the web page and page cleanup techniques, this paper puts forward a new web page cleanup techniques based on the DOM tree, and develops a web page cleanup tool on Eclipse. This tool can effectively cleanup most of the information unrelated to the subject of page, so it has a good practicality value and useful prospect. |