Design And Implementation Of HDFS-Based Microblog Data Management System

Posted on:2015-05-21

Degree:Master

Type:Thesis

Country:China

Candidate:J Xia

Full Text:PDF

GTID:2308330473950764

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of Web2.0 technologies,microblogging as a new medium in people’s field of vision.Microblogging its unique content dapper,spread fast and wide range of characteristics affecting more and more people’s attention and love,and has become one of the typical Internet applications.With the rapid development of microblogging,microblogging users access to information on demand linear rise.Faced with a flood of tweets every day,how quickly and accurately find the information they need microblogging is a serious problem.Based on the analysis of micro-Bot points, combining Hadoop distributed systems framework and Lucene full-text search engine, designed and implemented a set of micro-blog data management system.Firstly,the system needs analysis.Then,using a modular design approach to the overall design of the system.Details of the micro-blog data capture,data preprocessing microblogging,microblogging and distributed data storage,microblogging inverted index data,design and implementation of data sorting and microblogging microblogging data retrieval module.Through testing,the system functional evaluation.Finally,a summary and points out the shortcomings of the system.The main function of the system is a micro-blog data capture,pre-processing, storage,indexing,sorting and retrieval.The key technology in order to achieve the above functions,the system adopted by First microblogging Crawler,using API provides an open platform Sina Weibo crawl data;Second,distributed storage,distributed storage of data using microblogging HDFS distributed file system,HDFS provides high throughput data access,ideal for applications on large data sets,HDFS relaxed part of the POSIX constraints,to achieve the purpose of streaming data to read the file system,HDFS is now a top-level Apache project;three is a sorting algorithm,sorting algorithm PageRank reference to ideology,a sorting algorithm is proposed microblogging data.System utilizes an open platform Sina Weibo microblogging crawl API data in the system microblogging data preprocessing,and using HDFS distributed file system for storage.Lucene full-text search engine and uses the MapReduce programming model, using Lucene indexing engine provided microblogging build inverted index data. According to the characteristics of micro-blog data,provided by the use of Lucene query engine,designed and implemented a microblogging sorting algorithm,greatly improving the efficiency of micro-blog data retrieval.In the face of massive microblogging information,users can more quickly and accurately find the information they need.

Keywords/Search Tags:

Microblog, Lucene, Hadoop Distributed File System, MapReduce, Rank

PDF Full Text Request

Related items

1	The Research And Analysis Of Hadoop Small File Processing Method
2	The Design And Implementation Of A CBIR System Based On Hadoop And Lucene
3	Design And Realization Of Parallel File IO Based On Hadoop Distributed File System
4	Design And Realization Of Parallel File Io Based On Hadoop Distributed File System
5	Design Of Mapreduce Task Scheduling Algorithms In Heterogeneous Hadoop Cluster
6	Research And Optimization Of Reliability Of Hadoop Distributed File System
7	On Jackrabbit Packing Hadoop And It's Application In Content Management System
8	The Research And Implementation Of Distributed Sentiment Analysis For Chinese Microblog Based On Hadoop
9	Research And Optimization Of Hadoop Small File Processing Technology
10	The Research Of Microblog User Influence Algorithm Based On Hadoop