Font Size: a A A

Design And Implementation Of HDFS-Based Microblog Data Management System

Posted on:2015-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaFull Text:PDF
GTID:2308330473950764Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Web2.0 technologies,microblogging as a new medium in people’s field of vision.Microblogging its unique content dapper,spread fast and wide range of characteristics affecting more and more people’s attention and love,and has become one of the typical Internet applications.With the rapid development of microblogging,microblogging users access to information on demand linear rise.Faced with a flood of tweets every day,how quickly and accurately find the information they need microblogging is a serious problem.Based on the analysis of micro-Bot points, combining Hadoop distributed systems framework and Lucene full-text search engine, designed and implemented a set of micro-blog data management system.Firstly,the system needs analysis.Then,using a modular design approach to the overall design of the system.Details of the micro-blog data capture,data preprocessing microblogging,microblogging and distributed data storage,microblogging inverted index data,design and implementation of data sorting and microblogging microblogging data retrieval module.Through testing,the system functional evaluation.Finally,a summary and points out the shortcomings of the system.The main function of the system is a micro-blog data capture,pre-processing, storage,indexing,sorting and retrieval.The key technology in order to achieve the above functions,the system adopted by First microblogging Crawler,using API provides an open platform Sina Weibo crawl data;Second,distributed storage,distributed storage of data using microblogging HDFS distributed file system,HDFS provides high throughput data access,ideal for applications on large data sets,HDFS relaxed part of the POSIX constraints,to achieve the purpose of streaming data to read the file system,HDFS is now a top-level Apache project;three is a sorting algorithm,sorting algorithm PageRank reference to ideology,a sorting algorithm is proposed microblogging data.System utilizes an open platform Sina Weibo microblogging crawl API data in the system microblogging data preprocessing,and using HDFS distributed file system for storage.Lucene full-text search engine and uses the MapReduce programming model, using Lucene indexing engine provided microblogging build inverted index data. According to the characteristics of micro-blog data,provided by the use of Lucene query engine,designed and implemented a microblogging sorting algorithm,greatly improving the efficiency of micro-blog data retrieval.In the face of massive microblogging information,users can more quickly and accurately find the information they need.
Keywords/Search Tags:Microblog, Lucene, Hadoop Distributed File System, MapReduce, Rank
PDF Full Text Request
Related items