Font Size: a A A

Research On Efficient Distributed Storage And Query Algorithm For Real-time Data Stream

Posted on:2021-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:F J LinFull Text:PDF
GTID:2428330611467552Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Streaming data is a type of data paradigm widely existing in current application systems.Streaming data not only has a large amount of data,but also has real-time visible characteristics.It has the most processing value when it is first generated.In the past decade,the ability to retrieve streaming data in real time has become an important research work in the industrial Internet industry such as cloud computing and Io T.However,the existing distributed stream processing application research still faces the following challenges: 1)How to reduce the index update and data storage overhead caused when data is inserted into the system,and make full use of indexes to build a data storage model.2)How to reduce the time overhead caused by the combination of newly arrived data and historical data.3)How to reduce the query latency of complex aggregation condition queries,make full use of nonprimary key attributes to build indexes and parallel processing capabilities of distributed systems.In view of the above problems,this paper proposes an efficient distributed solution that enables the system to support the insertion of millions of data elements per second and the aggregation range query of millisecond granularity.The main work of the paper includes:First,in order to solve the performance bottleneck problem caused by the index update in the data storage and the node split during storage,an index structure layout scheme based on Template B + Tree is proposed.At the same time,a data domain division method based on primary key and time attributes is designed.The two-dimensional interval is constructed using the spatio-temporal characteristics of streaming data,and the new and historical data are distributed to different components for parallel processing to avoid the overhead of merging the new and old data of the index.Construct an index template corresponding to the data interval,avoid unnecessary index node splitting time overhead,and make full use of the template index information to design the storage structure and group compression algorithm to ensure the writing and storage efficiency of the index,thereby ensuring high concurrent data writing ability.Then,in order to achieve the system's low-latency aggregate query capability,this paper proposes an effective non-primary key multi-level indexing scheme.Using data region division and model topology design,the query request is parsed into independently executed subqueries,and the aggregate query is parsed into corresponding predicate functions through multi-level indexing,making full use of the concurrent processing resources of the distributed cluster.In order to realize the data locality and fault tolerance of the model,a local cache algorithm is designed for high-frequency data storage,and based on the query metainformation,it provides fault-tolerant recovery capability when querying faults.Finally,this paper implements a prototype of a distributed stream data processing system based on Storm,and evaluates the system performance using real data sets.First,the impact of the index structure compression storage and data layout capacity based on Template B + Tree on the system performance was evaluated,and then the improvement of query efficiency by building a multi-level index was verified,and finally the data write rate and range query on the overall system Compare with advanced data storage solutions HBase and DITIR.Comparative experiments on index overhead and aggregate queries show that compared to existing advanced systems,the model in this paper has better performance in data insertion and query performance.
Keywords/Search Tags:stream data, data partition, store, secondary index, distributed system
PDF Full Text Request
Related items