Font Size: a A A

Performance Optimization Of A Massive Data Query And Analysis System On Hadoop

Posted on:2016-07-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y WangFull Text:PDF
GTID:2298330467992004Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Today is an era of massive data, which are generated in almost all the professions, such as in electricity, telecommunications and financial industries. It is estimated that at least200GB of data are generated by each person in one year worldwide, while in2020the global amount of data will reach40ZB. All these leave a problem to be solved:by what means we can analyze and mine the data with agility, discover the underlying knowledge and dig out its business opportunities.As a big data analysis tool, Hadoop plays a more and more important role in our daily lifes. Hive, as a Hadoop ecosystem’s data warehousing software in data mining, presents more and more distinctive effect in data analysis and mining. As a versatile storage warehouse, Hive is inefficient in massive data queries. Non-customized Hive query time will increase in exponential growth with increasing amount of data. In that way, the optimized Hive will improve the efficiency, shorten the query time and reduce the data storage space greatly.Innovation of this dissertation lies in the following two aspects:First, this dissertation proposes an optimized data storage strategy based on Hive log analysis.Query logs can be used to analyze a user’s daily habit. Optimizing Hive system based on a user’s habit makes the optimization work more purposeful and targeted. Storage format is optimized in terms of data partition, data storage formation, removal of the redundant data tables and fields, as well as the field type rectification.Second, the dissertation presents an improved RCFile for Hive0.9.It is common that only a few fields in a big table with hundreds of fields are selected in a Hive query. Based on the advantages of the RCFile’s data stored in columns and by combining the Hive0.12ORCFile’s compression in colunms, this dissertation presents a new storage format based on RCFile, which is called Morcfile storage format and will improve the efficiency of a Hive query with only a few fields.At last, using the methods above, we modified Hive warehouse of a domestic financial institution. After testing, significant improvement is achieved in the optimized system in terms of query efficiency and disk space utilization.
Keywords/Search Tags:Hive, massive data, stored in columns, storage optimization, Morcfile
PDF Full Text Request
Related items