Fast Retrival And Query In Massive Data Environment

Posted on:2018-09-12

Degree:Master

Type:Thesis

Country:China

Candidate:S R Lin

Full Text:PDF

GTID:2348330518995317

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, applications and technologies in the big data field emerge one after another. Companies face many challenges in this area, they need to fully understand and use new technologies to improve their competitiveness, so that the value of their massive data can be maximized.In many enterprise data analysis needs, rapid retrieval and query (or so-called interactive analysis) is an increasingly important type of analysis. It provides enterprises with efficient analysis, shortening the decision-making cycle, saving time costs. As technologies for this type of analysis continue to evolve,relatively more mature solutions are starting to emerge in the industry, and there are also some emerging technologies that look at the analysis of big data from different perspectives, making them better performing for specific types of analysis.This thesis mainly studies and implements the solution from the aspects of fast query for structured data and fast query for semi / unstructured data. The former includes SQL-on-Hadoop and MOLAP engine Kylin, while the latter includes a query system based on HBase two-level index and a query system based on HDFS random access. For each of the four technical directions, this thesis describes their implementation principles, analyzes their performance, and optimizes them to some extent, and shows the effect of optimization. From the results, Cloudera's Impala wins among the mainstream SQL-on-Hadoop solutions, followed by SparkSQL. Using a reasonable data compression format Snappy and column storage format Parquet can greatly improve the efficiency of the queries. The other by setting the buffer method can also improve the performance of the two query systems. For the MOLAP engine Kylin, because it uses offline computing method, with the cost of data delay in exchange for the query speed, it performs better in decision support class queries than other query systems as SQL-on-Hadoop that use online computing method. In addition, the query systems based on HBase two-level index / HDFS random access provide a fast query of the entire row of data, which is superior to SQL-on-Hadoop / Kylin in querying multiple rows (including all columns) of data with certain conditions (Because they provide the index). The main difference between the two is that the query system based on the HBase second-level index has lower data delay, but the query data needs to be stored in HBase, which results in unnecessary data redundancy. The query system based on HDFS random access envolves a greater data latency, but because there is no need to store data in the HBase, it saves a lot of storage space.

Keywords/Search Tags:

interactive-analysis, SQL-on-Hadoop, HBase, performance tuning, OLAP

PDF Full Text Request

Related items

1	Research And Implementation Of Performance Tuning Method Of A Distributed Storage System Named Hbase
2	Research On Automatic SIMD Vectorization Recognization And Code Tuning Technology
3	Design And Implementation Of Hadoop-Based Network Traffic Analysis System
4	Research On Performance Prediction And Tuning Of Hadoop
5	Research Of Hadoop Configuration Tuning And Job Scheduling Based On Performance Evaluation
6	The Design And Implementation Of Network Traffic Analysis System Based On Hadoop And HBase
7	Key Issues On Hadoop Online Analytical Processing System
8	Scheduler and I/O Based Performance Tuning Approach for Hadoop
9	Design And Implementation Of Network Forensics Analysis System Based On Hadoop
10	Hbase Based Credible Dataware Construction Of Business Quarterly And OLAP Query Analysis