Font Size: a A A

Method And Implementation For Hive-Based Offline Data Processing

Posted on:2017-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y ZhuFull Text:PDF
GTID:2348330491464429Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid growth of offline data and business volumes results in huge overheads, long waiting time web page query for the traditional database technology and the simple Hadoop-based distributed computing methods. User experiences are seriously affected.In this thesis, an off-line data processing method is proposed which is based on Hadoop and Hive. Java timing tasks are adopted to start jobs. Taking into account real-time requirements of different jobs, running times are distributed to time periods to balance the system performance. Each offline data processing pro-cedure is regarded as a job. Every job is divided into several tasks. Jobs are triggered by Java timing tasks according to related information such as identifies, start times, cycle intervals. Timing tasks start jobs in terms of the query result obtained each minute. Different types of jobs begin to execute. A multi-dimension com-puting method is developed for complex statistical reporting jobs. Task templates are extracted from similar executions of jobs to improve reusability.The proposed methods are applied to an API Open Platform. Results show that the method reduces space consumption of redundant offline data, improves the protection of consumer's rights by predicting user frauds. In addition, the methods reduce time costs of report queries greatly by splitting the report data into multi-dimension statistics. User experiences are improved by reducing waiting times of querying web pages.
Keywords/Search Tags:Off-line Data Processing, Task Templates, Hive, Distributed Framework, Timing Task
PDF Full Text Request
Related items