Research On Customized Web Information Crawling And Pushing Techniques

Posted on:2017-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:X S Wu

Full Text:PDF

GTID:2348330491963013

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, web information grows uncontrolled, access to web information efficiently has become urgent requirements. However, the mainstream ways of access to information have common drawbacks:inflexible, poor timeliness, scattered without integration, and so on. To improve efficiency of web information acquisition, a customized web information crawling and pushing solution is proposed. For any webpage, users first specify their interests in the webpage, then they will receive updated interests from the server.To locate useful web information accurately, the positioning of web content is studied, and an interactive operation mode is proposed. To crawl dynamic page efficiently, key techniques on rendering webpage server-side is studied, and an efficient solution is proposed. The main contribution lists as below:1. An interactive operation mode is proposed to assist the user to convert webpage interests to computer-readable rules. To locate users’ interests server-side, a webpage positioning method based on xpath is designed, and automatic xpath generation is implemented.2. A scalable solution for dynamic page crawling based on cloud computing is proposed. To crawl dynamic pages efficiently, task queue and distributed processing are deployed to improve the concurrency of webkit, cache mechanism is deployed to improve the efficiency of webage rendering, the policy of dynamic adjustment of the number of servers based on task queue is deployed to improve the utilization of hardware resources.3. The prototype system is implemented based on the aforementioned solution, and experiments are conducted. Results show that the solution is effective and practical.

Keywords/Search Tags:

XPath, WebKit, Dynamic Page Crawling, Task Queue, Cache

PDF Full Text Request

Related items

1	Vertical Search Engine For Crawling The Web Page Design And Implementation
2	Research On Network Reptiles In Distributed Parallel Environment
3	A Study Of Hybrid Cache Management Mechanism Based On Page Classifier And Page Placer
4	Research On Algorithm Of Crawling Ajax Dynamic Web Pages Based On User Interface State Changes
5	Key Technology Research On Web Forums Crawling And Hot Topic Detection
6	Research And Implementation Of Distributed Internet Information Crawling System For Cyber Security
7	The Improvement Android Webkit-Based Browser Component
8	The Design And Implementation Of Embedded Browser Cache
9	Research On Energy Optimization For Multiprocessor SoC With Task Scheduling And Cache Partitioning
10	Research On Generation Algorithm Of XPath Locator Based On Web Page Element Subject Recognition