Font Size: a A A

Study On Methods To Real-time Query For Streaming Data

Posted on:2016-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:W XuFull Text:PDF
GTID:2308330461987518Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Streaming data refers to the data which is produced and computed in real-time, increases dynamically and requires response timely. Due to its large amount, real-timeness and other characteristics, streaming data managements systems are generally store recent data. Currently, streaming data is mainly stored in the databases based on distributed file system, namely:the Hadoop Distributed File System (HDFS) is used to store the data in the underlying of system, and MPP database is used to query data in the upper. Nowadays, databases based on HDFS have the following deficiencies when used in the applications. First, HDFS designs to store bulk data, so when it is used to store streaming data directly, there will produce a large number of small files. Therefore, it is difficult to meet the requirement of real-time query for the large time used to access data. Second, Databases used now need some time to activate for the delay which is used to start MapReduce. Third, when a query in an application requires data warehousing tools associated with traditional database, the existing methods which require copying the whole data in traditional database, need large amounts of space and time, but with low efficiency.To solve these problems proposed above, based on distributed file systems HDFS and data query system (impala), the thesis uses multi-level caching strategy to deal with the problems of storage and query to single and multi-source data, meanwhile, we study and test the cross-platform query methods between traditional database and warehouse tools that are based on distributed file system.The main contribution of this thesis includes three aspects. First, the study of query method to single source data, that is to change the format of streaming data, use cache mechanism to write them into the distributed file system, and then use impala to finish real-time querying. Second, the study of real-time query to multi source data and multi-level cache optimization, which is when there exists more than one data sources, we use multi-source single queue and multi-source multi queue to distinguish different data sources. Moreover, we also propose a multi-level cache optimization strategy which supports to query data with cache and thus can greatly improve query performance. Third, the study of the query method cross traditional database and data warehouse tools, which aims to the need of queries between traditional database and data warehouse tools. We propose a method to support queries between the two platforms. This method transforms the query results in traditional database to impala, and stores them in the form of temporary table, and then finishes the query on impala.The main innovative contributions of this thesis are as follows. First, proposing a query method based on cache to single-source data, which uses the caching mechanism to store data, and apply impala to query data timely. Second, adopting query methods to multi-source data, which is the extension of the query method to single-source data, and enhances the application’s scalability to multi source data; meanwhile, proposing a multi-level cache optimization strategy, which support to query data in cache and can greatly improve query performance. Third, proposing a method to query data between traditional database and data warehouse tools, which transforms the query results in traditional database to impala, and stores them in the form of temporary table, and then finishes the query on impala, greatly reducing the amount of data transferred between the two platforms, saving a lot of time used to transfer data and storage space, and meanwhile, improving the query efficiency.Finally, based on the clustering framework in CPU Center of Tsinghua University, we give the experiments for the presented methods. The experimental results proved the effectiveness of these methods.
Keywords/Search Tags:Streaming Data, Real-time Query, Caehe Strategy, Heterogeneous Databases, Related Queries
PDF Full Text Request
Related items