Font Size: a A A

Research On Storage And Transmission Optimization Of Containerized DataScience Workflow System

Posted on:2022-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:M J LiFull Text:PDF
GTID:2518306722471924Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of data science,data analysis researchers need to analyze a large amount of data to find data patterns,and then propose solutions to specific problems.In the data science workflow,data analysis researchers usually need to go through necessary steps such as data preparation,environmental preparation,data analysis,and result analysis.This article mainly solves three problems in the current data science workflow:First,many scientific researchers share physical resources in the cluster,which adds a lot of difficulty to maintaining each researcher's private development environment,switching environments or installing new softwares will also consume a lot of time and energy for developers;second,when multiple researchers share cluster resources,they often generate a large amount of redundant data,which brings greater storage pressure to the cluster;third,when using traditional data transmission method,the transmitting speed of a single data file and the speed of distributing it to multiple nodes in the cluster are often relatively slow.When a network failure interrupted the data file transmission,it is usually necessary to retransmit,which causes a lower transmission reliability.In order to solve the above problems,this paper designs and implements a containerized distributed data science workflow platform to prevent researchers from spending a lot of time in environmental preparation.At the same time,corresponding optimization methods are proposed for the problems of data storage and data transmission in the platform,which reduces the generation of redundant data and the storage pressure of the cluster,and improving reliability and the speed of the transmission of a single file and its distribution to multiple nodes in the cluster.The specific work and contributions are as follows:1.Proposed storage optimization strategies and methods in a containerized data science workflow platformIn the context of containerized data science workflows,optimize the data storage method to reduce the generation of redundant data and reduce the storage pressure of the cluster.Specifically,this article creates a dynamic merge layer based on OverlayFS,merges the folders that users need,and mounts them in the container in the form of file volumes.Using the characteristics of OverlayFS copy-on-write,multiple users use the same data set for data During analysis,an independent writing space is created for each user,which reduces the generation of redundant data.In addition,optimizing the storage method of low-frequency files,using erasure coding technology to store low-frequency access files,under the premise of ensuring high reliability,sacrificing efficient recovery in exchange for a smaller storage footprint,reducing the storage pressure of the cluster.2.Proposed data transmission optimization strategies and methods in the containerized data science workflow platformIn the context of containerized data science workflows,optimize the transmission speed and reliability of data transmission.In terms of data transmission speed,this article uses the multi-threaded fragmented transmission method to improve the speed of transmitting a single file;at the same time,when users want to use multiple different algorithms to analyze the same data set,the platform will quickly find out the relative data in the cluster.Idle machines use dynamic planning and optimized transmission strategies to quickly transmit data files needed by users to designated nodes in the cluster to achieve the purpose of quickly starting multiple tasks and improve the efficiency of transmitting and distributing files.In terms of data transmission reliability,this article uses the fragmented transmission method.When the file transmission is interrupted due to network failure or other problems,the data file can continue to be transmitted from the location where the transmission was interrupted,which improves the reliability of data transmission.3.Designed and implemented a containerized distributed data science workflow platformThis article focuses on the three major procedures of the "data environment preparationalgorithm preparation-data analysis",and according to the idea of data science workflow,a distributed data science workflow platform that supports data management,automatically builds the environment,and iteratively analyzes data is constructed.Through the functional test and optimization test of the platform,the feasibility of the above data storage optimization method and data transmission optimization method was verified,and the effectiveness of the containerized distributed data science workflow platform was verified.
Keywords/Search Tags:Container Cloud, Data Science Workflow, Distributed System, Storage Optimization, Data Transmission
PDF Full Text Request
Related items