Font Size: a A A

Development Of RSS Content Ingesting And Layout System Based On Hadoop

Posted on:2016-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2348330503994297Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the past few years, the Internet technology and cloud computing technology greatly promoted the development of the community. The impact of the traditional publishing industry has also created an opportunity. The traditional newspaper, magazine and books generation process most is a manual process, whose cost is high, speed is slow, style is rigid, and choice is few. Under this background, this paper developed a RSS content grab and typesetting system using Hadoop, to automatically grab the RSS feed content, parse the page content, automatically generate publication with subscribed edition and transmit to the printing machine connected to the network.Firstly, we analyzed the functional and non-functional requirements of the RSS content capture and layout system and carried out the architecture design. The system uses distributed architecture, and is divided into five subsystems:(1)Portal subsystem, is an open platform for content publishers and consumers to publish content and registered edition, and subscribe edition service;(2)Job Server subsystem, performs the task scheduling;(3)Web Kit Cluster subsystem, is responsible for web crawling;(4)Algorithm Cluster subsystem extracts the contents of web pages;(5) Layout Engine Cluster subsystem, automatically layouts to generate optimal publications.Then, the two core subsystems, Job Server and Layout Engine Cluster, were designed and implemented in detail. Their core process and multi-layer structure were described. The following key technologies were researched and implemented: dynamically task schedule based on efficient concurrent framework; adopt Active MQ, Redis, Mongo DB and HDFS to provide flexible and efficient data storage; generate high quality publications with an automatic layout algorithm.Finally, the system passed the configuration testing, fault testing, function testing and performance testing. These test results show that system can also support 10000 the number of online users and achieve the desired objectives. At present this system had go-live and deployed in Hewlett Packard Personal Print Services Division. It not only achieves the publisher's operating costs reduced, but also submits publications to customers faster than before. The production duration is from 2 hours down to 2 minutes. After nearly 10 months go-live, the average cost of a publication drops from $12.50 to $4.31.
Keywords/Search Tags:Hadoop, extract content, auto layout, task uniform scheduling
PDF Full Text Request
Related items