Font Size: a A A

Design And Implementation Of Forum Data Analysis Platform Based On SPARK

Posted on:2018-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:H H WangFull Text:PDF
GTID:2348330518994421Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the increase of the forum users and popularity of the internet,users always feel confused because of the variety of forum plates and the huge amount of titles as they enter various forums. Due to users' time and energy is limited, many forums have pushed functions like "design second level plates" "popular posts list" to save their time costs and make a simple recommendation. But actually, these functions still have some weakness:some posts don't occur where they should be, the ranking strategy is too simple, meaningful posts are ignored because it belongs to the unpopular plates, and so on. What's more, sometimes just one forum can not satisfy a user, which makes them switch back and forth among serveral forums. This increases users' time costs too.For the consideration of making forum users get valuable informations from single or several forums, this paper designed and implemented a system, which can gather forums' information together and analyse by muti-dimension. For the purpose of realizing all the functions and clear development logic, this system is designed as two parts: data analysis platform and data presentation web application, one handles the data analysing tasks and the other makes interaction with system users.This thesis focuses on the previous one.The order of this paper is as follows: it provides an overview of the overall fuctionality of the platform first. Then it will give an introduction of muti-module design ideas and the technical background knowledge included. Thirdly it will introduce the implementation details of each module, including :1) Finish data collecting by implementing web spiders.2) Using HDFS+Spark to implement big data storage and computing framework in the purpose of forum data analysis, simultaneously, this step includes realization of topic extraction of the posts by TDT strategy and optimizing Spark calculation tasks by two-stage-aggregation strategy.3)Parallel processing of external query requests by Actor message model. At last, it will describe the system deployment and test results.
Keywords/Search Tags:multi-forum data integration, web spider, spark distributed computing framework, topic extraction of posts, actor message model
PDF Full Text Request
Related items