Font Size: a A A

Design And Implementation Of Columnar Storage System For User Behavior Logs

Posted on:2023-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:F J WangFull Text:PDF
GTID:2568307298455134Subject:Computer technology
Abstract/Summary:PDF Full Text Request
User behavior logs have typical big data characteristics,such as multi-attribute,real-time,and large data volume.They are mainly used in scenarios such as data analysis and auxiliary decision-making.The query request in this kind of scenario has the characteristic of locality,that is,only a few or dozens of orders of magnitude properties are queried locally among many properties,and high throughput is required for batch sequential reading.Therefore,the user behavior log data mostly uses HDFS as the storage engine,and the bottom layer uses the columnar file model ORC or Parquet as the data organization structure to support query engines such as Hive,Spark,etc.to access the data.This thesis aims to optimize the design and implementation of ORC and Parquet in terms of index structure,data distribution and data compression,and proposes a multi-level indexed columnar storage file model and a columnar storage compression method based on distribution characteristics.Build a user behavior log columnar storage prototype system,the specific work includes:(1)Requirement analysis and outline design of the columnar storage system for user behavior logs: Based on the system business architecture,analyze functional requirements such as data reading and writing,data collection,and non-functional requirements such as reading and writing efficiency,compression rate,and fault tolerance.Grasp the key features of the "user object-event" data model,study the columnar storage mode of data in combination with the application scenario of partial access,and analyze the way of reading and writing data in the columnar storage mode.Based on the system business architecture,the system hierarchy is designed,and the system data acquisition module,read-write module,and data partition scheme are briefly designed.(2)Multi-level index columnar storage file model design: A multi-level index columnar storage file model,Blade File,is proposed to speed up the target data location process.The additional I/O load caused by the reading of irrelevant data during the indexing process is taken,and the statistical information of related attributes is stored in the index to optimize related grouping and aggregation query operations.(3)Columnar storage compression method based on distribution characteristics: In the columnar storage mode,the data type of each column of data is the same,and there is similarity in the distribution characteristics between the local regions after horizontal division of each column.Therefore,a columnar storage compression method based on distribution characteristics is proposed,which combines the data type of local data columns,data entropy and dynamic selection compression algorithm of distribution characteristics to improve the space utilization of the system.(4)Implementation and testing of the columnar storage prototype system: Take the designed columnar file model and compression method as the core,select Hive as the data access tool,replace its underlying storage structure with Blade File,and combine Hadoop,Kafka,etc.to build collection,Storage-integrated columnar storage system.The system focuses on the realization of acquisition module and storage module,and performs functional and nonfunctional tests on the system based on the prototype system with Blade File as the data storage structure.The test results show that Blade File has higher query efficiency and higher efficiency than ORC and Parquet.At the same time,the storage system with Blade File as the underlying storage structure has high read and write efficiency,which can meet the actual business needs.
Keywords/Search Tags:Local access, Columnar storage, Multi-Level index, Columnar compression method
PDF Full Text Request
Related items