Font Size: a A A

Social Stream Generation For Database Benchmarking

Posted on:2017-10-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C YuFull Text:PDF
GTID:1318330512957588Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A social stream refers to the data stream that records a series of social entities and the dynamic relations between entities. It can be employed to model the changes of entity states in numerous applications, such as tweets and retweets published by users in social media, the citations between scientific documents, the data transmissions among nodes in distributed systems, etc. The social stream distinguishes itself from traditional networks and data streams in that it is a combination of graph and stream data. To specify, the entities and the network structure are dynamically changing in a social stream. Because of the complex characteristics of graph and stream data, the social stream data has great commercial and research value. Therefore, it is the focus of academic and industry to de-velop effective data management and data mining systems for social streams. At present, a variety of techniques can be used to manage or process social streams. Hence it is vital to select the appropriate data generator for social data benchmarks. However, real-world social stream data is usually not present in existing benchmarks due to several issues such as data privacy, the difficulty in shifting large-scale data, etc. Therefore, a social stream generator is of great significance, which is capable of generating large-scale "realistic"data flexibly and efficiently.This thesis explores techniques to generate social stream data for database bench-marks. The proposed method can generate data corresponding to the characteristics of the real data for different social stream types. In order to generate large-scale data with high throughput, we design and implement a system to generate social stream data in a parallel manner. In addition, based on the proposed data generator, this thesis designs a benchmark for social media analysis. The main contributions are listed as follows:1. A single-link social stream generation method is introduced based on the hu-man dynamic model and temporal growing network. A social item in a single-link social stream links to only one historical item at most. This approach utilizes the iterative update process of two buffer pools to generate single-link social stream data chronologically. One of the buffer pools is the next-item pool, which stores social items that each producer will post in the future. Another is the recent-item pool, which keeps the recent historical items in a specified window size. In the it-erative update process of two pools, it uses the human dynamic model to generate social items without linkage information for each producer, and uses the temporal growing network model to determine linkage information of each item. Thus, users can generate social stream data of a given size and a given data distribution by set-ting the parameter configurations. Experiments show that the proposed method can generate realistic single-link social data continuously with stable throughput and memory consumption.2. A multi-link social stream generation method is proposed based on the human dynamic model and the growing network model. Social items in a multi-link social stream can link to multiple historical items. Therefore, there are new require-ments on linkage generation in the process of generating a multi-link social stream. Based on the single-link social stream generation approach, the two buffer pools can also be used to generate multi-link social stream data chronologically. Both the extended temporal growing network model and the edge copying model are used to generate linkage information. Experiments show that data generated by the ex-tend temporal growing network model can better match real data distributions. The method of generating multi-link social stream based on the extend temporal growing network model, is able to generate realistic data with stable throughput and memory consumption.3. A system of parallel social stream generation is implemented using the master-worker architectur. In order to generate large-scale social stream data with high throughput, this thesis designs and implements a parallel system for generating single-link and multi-link social streams. A framework of a master and several workers is used to generate data. Based on temporal growing network model for generating linkage information, each worker is responsible of generating the partial single-link or multi-link social stream created by producers in the partition. The master merges all partial social streams from each worker for generating the global social stream. The parallel linkage generation method, asynchronous model and the delay update strategy are used for the implementation of the system. Experiments show that the parallel generation system can generate realistic the single-link and multi-link social stream data, and the system can increase throughput of generating data linearly by increasing workers.4. Based on social stream data generator, a benchmark based on social media analysis is designed. Social media services have become one of the most popular services on the Internet where social media data is a typical kind of social stream data. This thesis designs a benchmark based social media analysis called BSMA, which includes data feeding, workload generator and a performance testing tool. The data feeding part provides not only a real dataset of Sina microblog but also a social stream generator BSMA-Gen. BSMA-Gen generates data based on the par-allel generation system proposed in this thesis. The workload generator defines the social media data model and four categories of 24 query templates and provides a parameter generator for generating parameter values to query tasks according to re-quirements. BSMA-Gen can be used as the data support for these queries, including the temporal and link relation network queries in social streams. BSMA also pro-vides a performance test tool. Users are required to connect to the testing system with the test tool first, and then configure and perform tasks. The evaluation results will be given eventually.In summary, this thesis formally defines the social stream model and related charac-teristics. The architecture, models and generation algorithms of single-link and multi-link social streams are proposed. Users can configure the social stream generator to generate data with the specified data type and data distributions. In order to generate large-scale data with high throughput, this thesis designs and implements a parallel generation sys-tem. In addition, a benchmark based on social media analysis is proposed based on the social stream generator.
Keywords/Search Tags:Social Stream, Data Generator, Parallel Generation, Benchmark
PDF Full Text Request
Related items