Social Network Data Acquisition Technology And Implementation

Posted on:2012-11-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y N Hu

Full Text:PDF

GTID:2218330362951673

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet and the Information Technology, the information on the Web is becoming unprecedented rich. The Web can be divided into Surface Web and Deep Web by the depth of their information. Deep Web data resources include the dynamic page, which need to be generated through the database query interface and the proprietary network information which should log in before you can view. The emergence of search engines, to some extent address the needs of information query, but the traditional search engines cannot index these Deep Web pages. Today, rapidly emerging SNS is attracting a large number of active cyber-citizen. Their information resources are more abundant and have high value. This thesis analyzes the framework of acquire data for the social network, and designs the crawler for Twitter, Facebook and Renren. It completes the design and implementation of crawlers management and data vision. Specific studies are as follows:1. The design of Deep Web crawler framework and module is researched. The Deep Web includes searchable databases and proprietary networks. For the searchable databases, the crawler should find the data source interface first, then query the interface, and combine the results at last. For proprietary networks, spiders must get a site license at first, then crawl the page and analyzes it. Finally, combine the results as before.2. Design and implement the crawlers of the Twitter, Facebook and Renren. The Twitter crawler's strategy is that, first to obtain an Access Token through the OAuth certification, and then acquires the data of Twitter by the Twitter API incrementally. The Facebook crawler's strategy is to use HtmlUnit to login and get an Access Token at first, and then call the Facebook Graph API to acquire user's news feed incrementally, and finally parse the returned JSON data and uniform their format. The Renren crawler's strategy is to use HtmlUnit's WebClient to login and save the Cookie firstly, and then use the WebClient crawl user's pages incrementally, the status and notes will be analyze. After functional tests and large-scale performance tests, these crawlers can meet the needs of practical work, with the excellent stability and adaptability. 3. The Crawler's Management System is researched. A console is designed and daemon is deployed in each crawl machine, they communicate with each other to achieve the tasks allocation and loads balancing. The daemon monitors the node's status and parses the data of common crawler. After the experiment and analysis, the Crawler's Management System executes instructions complete accurately and has a good performance of communications and data in large-scale cases.4. The Flash's ActionScript2.0 language to realize data visualization is studied. We complete a dynamic pie chart to show the acquired data.

Keywords/Search Tags:

Deep Web, social network, crawler, HtmlUnit, Access Token

PDF Full Text Request

Related items

1	Design And Implementation Of An Ajax Supported Deep Web Crawler System
2	Research And Optimization Of Dynamic Web Crawler Based On Webmagic
3	A Token-passing Algorithm Improved Wireless Token Transfer Protocol
4	Design And Implementation Of Social Network Information Crawler
5	Wireless Subnet Token Protocol-WSTP
6	Personalized News Recommendations In Social Networks
7	Dynamic Clusterhead Token Transfer In Hybrid PMP/Mesh Networks
8	MODELING AND VERIFICATION OF LOCAL COMPUTER NETWORK DATA LINK LAYER PROTOCOLS (CARRIER-SENSE, MULTIPLE-ACCESS, TOKEN-PASSING BUS RING)
9	Design And Implementation Of A Web Crawler Based On Deep Web Deep Data Acquisition
10	The Design And Implementation Of The Deep Web Acquisition System