| The development of information technology has effect on the whole world, it is the main technology during the development of information technology. As it exists nowhere in the human life and effects human life, getting information becomes the focus in the science circles. In the early days, there is less demand of information, so, people can get information that they want easily. But as world wide web develops, it becomes very hard to do. Search engine becomes it easy again. But there is a great deal of web noise in such great mount of webs, it reduces the nicety of search engine, and increases the load of server.First, the thesis introduces the key technology of the web purification system which includes the definition and characteristic of DOM(Document Object Model) technology; Web structure: denotation of web label tree, how does the web structure denote with web label tree and DOM tree; Web page segment technology: with the DOM technology and some important HTML label, we can segment the web page. Meanwhile this paper specializes some web page segment rules, all that can help you understand the implement of this system.Then, the thesis analyzes the structure of HuiCong Search Engine: Webserver and SO(Shared Object), Cache, newest database, database and web purify system. and the relation between these systems. The analysis of searching process: the users type the string into CGI from web server. CGI deals with these strings, put them to search engine system, then put all interrelated webs into PageClean system, the result can display in the browser. PageClean is the key part of this thesis. We discuss the arithmetic of this system and rules of implement.Finally, the thesis discusses the test method of PafeClean, gets conclusion: PageClean system can reach the expected target. |