| With the rapid development of the Internet, web information grows uncontrolled, access to web information efficiently has become urgent requirements. However, the mainstream ways of access to information have common drawbacks:inflexible, poor timeliness, scattered without integration, and so on. To improve efficiency of web information acquisition, a customized web information crawling and pushing solution is proposed. For any webpage, users first specify their interests in the webpage, then they will receive updated interests from the server.To locate useful web information accurately, the positioning of web content is studied, and an interactive operation mode is proposed. To crawl dynamic page efficiently, key techniques on rendering webpage server-side is studied, and an efficient solution is proposed. The main contribution lists as below:1. An interactive operation mode is proposed to assist the user to convert webpage interests to computer-readable rules. To locate users’ interests server-side, a webpage positioning method based on xpath is designed, and automatic xpath generation is implemented.2. A scalable solution for dynamic page crawling based on cloud computing is proposed. To crawl dynamic pages efficiently, task queue and distributed processing are deployed to improve the concurrency of webkit, cache mechanism is deployed to improve the efficiency of webage rendering, the policy of dynamic adjustment of the number of servers based on task queue is deployed to improve the utilization of hardware resources.3. The prototype system is implemented based on the aforementioned solution, and experiments are conducted. Results show that the solution is effective and practical. |