Font Size: a A A

Social Network Data Acquisition Technology And Implementation

Posted on:2012-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y N HuFull Text:PDF
GTID:2218330362951673Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and the Information Technology, the information on the Web is becoming unprecedented rich. The Web can be divided into Surface Web and Deep Web by the depth of their information. Deep Web data resources include the dynamic page, which need to be generated through the database query interface and the proprietary network information which should log in before you can view. The emergence of search engines, to some extent address the needs of information query, but the traditional search engines cannot index these Deep Web pages. Today, rapidly emerging SNS is attracting a large number of active cyber-citizen. Their information resources are more abundant and have high value. This thesis analyzes the framework of acquire data for the social network, and designs the crawler for Twitter, Facebook and Renren. It completes the design and implementation of crawlers management and data vision. Specific studies are as follows:1. The design of Deep Web crawler framework and module is researched. The Deep Web includes searchable databases and proprietary networks. For the searchable databases, the crawler should find the data source interface first, then query the interface, and combine the results at last. For proprietary networks, spiders must get a site license at first, then crawl the page and analyzes it. Finally, combine the results as before.2. Design and implement the crawlers of the Twitter, Facebook and Renren. The Twitter crawler's strategy is that, first to obtain an Access Token through the OAuth certification, and then acquires the data of Twitter by the Twitter API incrementally. The Facebook crawler's strategy is to use HtmlUnit to login and get an Access Token at first, and then call the Facebook Graph API to acquire user's news feed incrementally, and finally parse the returned JSON data and uniform their format. The Renren crawler's strategy is to use HtmlUnit's WebClient to login and save the Cookie firstly, and then use the WebClient crawl user's pages incrementally, the status and notes will be analyze. After functional tests and large-scale performance tests, these crawlers can meet the needs of practical work, with the excellent stability and adaptability. 3. The Crawler's Management System is researched. A console is designed and daemon is deployed in each crawl machine, they communicate with each other to achieve the tasks allocation and loads balancing. The daemon monitors the node's status and parses the data of common crawler. After the experiment and analysis, the Crawler's Management System executes instructions complete accurately and has a good performance of communications and data in large-scale cases.4. The Flash's ActionScript2.0 language to realize data visualization is studied. We complete a dynamic pie chart to show the acquired data.
Keywords/Search Tags:Deep Web, social network, crawler, HtmlUnit, Access Token
PDF Full Text Request
Related items