Font Size: a A A

Fast Retrival And Query In Massive Data Environment

Posted on:2018-09-12Degree:MasterType:Thesis
Country:ChinaCandidate:S R LinFull Text:PDF
GTID:2348330518995317Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, applications and technologies in the big data field emerge one after another. Companies face many challenges in this area, they need to fully understand and use new technologies to improve their competitiveness, so that the value of their massive data can be maximized.In many enterprise data analysis needs, rapid retrieval and query (or so-called interactive analysis) is an increasingly important type of analysis. It provides enterprises with efficient analysis, shortening the decision-making cycle, saving time costs. As technologies for this type of analysis continue to evolve,relatively more mature solutions are starting to emerge in the industry, and there are also some emerging technologies that look at the analysis of big data from different perspectives, making them better performing for specific types of analysis.This thesis mainly studies and implements the solution from the aspects of fast query for structured data and fast query for semi / unstructured data. The former includes SQL-on-Hadoop and MOLAP engine Kylin, while the latter includes a query system based on HBase two-level index and a query system based on HDFS random access. For each of the four technical directions, this thesis describes their implementation principles, analyzes their performance, and optimizes them to some extent, and shows the effect of optimization. From the results, Cloudera's Impala wins among the mainstream SQL-on-Hadoop solutions, followed by SparkSQL. Using a reasonable data compression format Snappy and column storage format Parquet can greatly improve the efficiency of the queries. The other by setting the buffer method can also improve the performance of the two query systems. For the MOLAP engine Kylin, because it uses offline computing method, with the cost of data delay in exchange for the query speed, it performs better in decision support class queries than other query systems as SQL-on-Hadoop that use online computing method. In addition, the query systems based on HBase two-level index / HDFS random access provide a fast query of the entire row of data, which is superior to SQL-on-Hadoop / Kylin in querying multiple rows (including all columns) of data with certain conditions (Because they provide the index). The main difference between the two is that the query system based on the HBase second-level index has lower data delay, but the query data needs to be stored in HBase, which results in unnecessary data redundancy. The query system based on HDFS random access envolves a greater data latency, but because there is no need to store data in the HBase, it saves a lot of storage space.
Keywords/Search Tags:interactive-analysis, SQL-on-Hadoop, HBase, performance tuning, OLAP
PDF Full Text Request
Related items