Font Size: a A A

AN Elastic Parallel Framework For Realistic Stream Benchmark Generation

Posted on:2016-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:L GuFull Text:PDF
GTID:2308330461975931Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Internet brings better service to the public, while it also brings data deluge. The value of data is amazing, which proposes a huge challenge to data analysis, e.g. more efficiency data mining algorithms, more applicable data analysis systems and so on. It costs much more human, matereial and time to study and develop an efficiency and applicable big data analysis platform. However, as the data volume is growing exponentially, it is un-predictable for the demand of human in further 4 or 5 years, especially for more complex dependency of real data. To test the applicability and stability of the system, it is neces-sary to generate the realistic data of further 4 or 5 years. However, traditional data gen-erators can generate data according to the static distribution, which is not enough to-test the systems. To test algorithms and systems efficiently, it is important to preserve more properties of real data as much as possible, e.g. data distribution, correlation between attributes, temporal dependency. The temporal dependency is vital, for the influence of latent factors. For example, the announce of earthquake promotes the re-tweet behaviors in social network, or the bankruptcy of lehman brothers leads to the huge fluctuation of stock price or even the financial field. For the universal and realistic data generation, we provide an elastic distributed generation framework, which can use any stream data set as the input. The generation framework can analyze the real data distribution, attribute correlations and temporal dependency, in addition to preserve the properties as much as possible, then generate realistic synthetic data streams at defined velocities. To improve the university of generation, we will extract latent factors based on the temporal LDA model.In the paper, we propose a new generating framework, which can simulate realis-tic and fast data streams in an elastic manner. We name the generating framework as Chronos. Given a sample database with timetamps, Chronos can generate the realis-tic synthetic database, with keeping the correlation between attributes and dependency between snapshots. In addition, Chronos is equipped with the functionality of domain expansion, and can preserve the order statistics. To achieve the realistic database, we do the following works:● An universal data generating framework:the framework defines the user-defined language, patterns of dependency of multiple tables. The framework is equipped with a user-extended model and allows users to define more patterns.● The data schema decomposition algorithm:the algorithm partitions the relation into groups, with minimizing the correlation loss and maximizing the correlations in one group.● A new temporal dependency extraction and simulation method:the method is based on the standard LDA model, and extracts the dependency between snapshots. In addition, it protects and simulates the dependency, with keeping the order statistics between synthetic data and real data.● An elastic generating paradigm:given the distributions in one snapshot, the algo-rithm can generate data quickly. According to the user-defined velocities for all the snapshots, multiple nodes will generate data with excellent load balancing in the minimum synchronization cost.● A domain expansion algorithm:according to the distribution of real data, the al-gorithm can update the new attribute values with keeping the distribution of real data.● Various experiments:multiple real data sets are used to test all the arguments in Chronos, which compares the synthetic data and real data. In addition, the func-tionalities of elastic generation and domain expansion are tested. Comparison ex-periments with other generators proves the efficiency and applicability of Chronos.
Keywords/Search Tags:Chronos, Data Generator, Column Correlation, Temporal Dependency, Order Statistics, Domain Expansion, Elastic Generation
PDF Full Text Request
Related items