Font Size: a A A

Research On Data Mining-oriented Cloud Resource Deployment And Purchase Strategy

Posted on:2013-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:M M MaFull Text:PDF
GTID:2248330395960501Subject:International Trade
Abstract/Summary:PDF Full Text Request
Data mining is the process to acquire potentially useful information and knowledge to provide decision support for users on the analysis of massive data. Data mining tasks, with ever-increasing problem size and complexity, need high-quality computing resources and a lot of storage space to deal with the massive data from different regions and organizations. High-performance, high-capacity data mining system can’t be widely applied to small and medium enterprises because of the technology or capital restrictions. Therefore, the concept of cloud-based data mining services come into being, a new service mode that can better adapt to the growth of the data volume and cross-regional business operations, obtain better data sharing and timeliness at a lower cost. With the wide-spread adoption of cloud-base CRM,ERP system, some data of small and medium enterprise has been stored in the cloud, which set a solid base for the cloud-based data mining application.However, the deployment and implementation of cloud-based data mining applications calls for the collaboration of multiple distributed data centers, so inter-data center data transfer and frequent task interaction are almost inevitable. The unreasonable data layout and task scheduling policies may cause excessive cross-centers data transfer,and will also affect the efficiency of the data mining application implementation. Therefore study the layout strategy based on cloud computing resources is of importance.The main challenges cloud-based resource allocation problem faces are as followings:(1)some sets of data must be placed in a fixed location and can’t be moved because the ownership,(2)intermediate data that generated during the execution of data mining application need to be allocated,(3)task will only execute after all the required data has been put locally, hence data transfer across the data center will happen and will cause transfer cost,(4) the execution time of single task may vary according to the processing efficiency of the data center,(5) total completion time of the application is not only related to the execution time of a single subtask,but also with the sequential relationship between the various sub-tasks.So the total complete time may differ according to different workflow morphology that implies input-output relationship among sub-tasks. How to make out a reasonable data deployment and tasks scheduling strategy to meet the need of reducing the inter-data center transfer and shortening the overall completion time of application,and finally achieve dual optimization both in cost and efficiency is a difficult problem.To address the above issues,a dependency-based resource deployment strategy (including data layout strategies and task scheduling strategy) is proposed to reduce data transfer across the data center and shorten execution time.Then proposed a critical path-based multi-instance combination purchase strategy to optimize the cost further. The main work includes:(1)Cloud-based data mining service model.(2)Cloud-based resources deployment problem modeling. Mainly include data mining application modeling,cloud computing environment modeling,data transfer across data center modeling and task completion time analysis.(3)Proposed a dependency-based cloud resources deployment strategy.①Data dependency-based storage allocation strategy. For initial data sets,considering the inter-data relationship,the size of the dataset and the dataset location, optimize data deployment strategy aiming to reduce the amount of cross-data center data transfer. For intermediate data,deploy them to the data center with the highest data-data center association degree.②Task scheduling strategy aiming at minimizing total execution time.When calculate the key indicator total execution time, the paper considers the single task execution time and the specific workflow patterns rather than merely get accumulation value or mean value. Design a task schedule strategy minimizing the total execution time.③Design a data allocation and task schedule strategy using genetic algorithm.The algorithm is to find out the best solution that can achieve dual optimization both in cost and time.(4)Design a critical path-based multi-instance combination purchase method.Critical path method is proposed to distinguish the real-time nature of the task. For the non-real-time tasks, combining multi-instance with different configuration and pricing model can further reduce the cost of cloud computing.
Keywords/Search Tags:data mining, cloud computing, data allocation, task scheduling, multi-instancecombination
PDF Full Text Request
Related items