Font Size: a A A

The Research On Spark Based Time Series Reorganization Of Astronomical Catalogs Without Preloading Original Files

Posted on:2021-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:L Y FuFull Text:PDF
GTID:2480306317467704Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Time series data reorganization is the most basic data processing process for time domain analysis in time domain astronomy.In time series astronomy,the observation equipment has the characteristics of frequent sampling on the time axis,which leads to the mass of time series data.It is usually time-consuming,inefficient and limited by the storage space when we use traditional scientific calculation method to realize the reorganization of time series data.At the same time,with the rise of distributed computing,Spark and other computing frameworks are very suitable for the application of astronomical big data,but because the distributed computing method needs to transfer the original data to a dedicated file system,it causes a large space and time cost.In order to solve this problem,combined with previous work and the characteristics of the Spark computing framework,based on time domain of the science of astronomy era of big data requirements,through the data index,the storage I/O optimization,object identification algorithm to optimize the aspects of theory and technology research,solves the batch-oriented observation catalog data sequence restructuring,achieve the balance of efficiency and accuracy of the storage space and optimization.The design focuses on the following two aspects:The first is the optimized design of the original star catalog file fetch.That is,in terms of data access optimization,raw data file pointers are used instead of loading all data to a dedicated file system to load,avoiding the space pressure of large-scale data transfer,and through to the raw data index and storage layout optimization,improve the locality of authentication calculation space,reduce data transmission,and ensure the global balanced utilization of cluster resources when generating large-scale time-series data requests,in order to fully realize the comprehensive optimization of execution efficiency and storage space.The second is the optimization of Spark certification calculation.On the authentication calculation algorithm,a HEALPix data filtering strategy is proposed to reduce the range of authentication calculations,edge data processing methods based on transmission and calculation overlap,and the shuffle optimization of the Spark platform to reduce time consumption to ensure the accuracy and efficiency of authentication calculations.This makes it possible to efficiently identify large batches of stars.In addition,the research uses a combination of theory and practice.Through the research and development of the time series reorganization of the AST3 telescope star catalog data,the integrated verification of various research results of the design is performed.The experimental results show that the design can well achieve large-scale satellite.The time series reorganization of table data effectively improves the generation efficiency of astronomic time series data products,and promotes the rapid development of time domain astronomy research in China.
Keywords/Search Tags:time-series data, Spark, fetch optimization, authentication calculation, AST3
PDF Full Text Request
Related items