Font Size: a A A

Research On Data Mining Execution Process Model In Grid Environments

Posted on:2013-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:1228330395967902Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, large amounts of data is produced in different applications and accumulated at different locations in distributed way. How the useful and hidden knowledge/patterns can be extracted from the accumu-lated data is one of the most challenging issues. Grid technology enables the collabora-tion and sharing among distributed and heterogeneous resources. Applying data mining in Grid provides an effective solution to extract knowledge from large-scale geographi-cally distributed data. Since data mining is a non-trivial process which is composed of many operations executed on large amounts of data, the combination of data mining and Grid will inevitably increase the complexity of data mining processes. In the previous research, data mining process is always treated as independent black-box algorithms in applications in which the functionality and intermediate steps are hidden. During this process, the execution processes of data mining are invisible to users and environments, and data mining algorithms used in central environments cannot be automatically trans-formed to the processed that can be executed in distributed environments according to the distributed resources, and users cannot control data mining execution; moreover, the independence between the interfaces for data mining services and Grid services are in-convenient for users to access data mining services in Grid. As a result, data mining cannot work efficiently as we expect in Grid environments. As the problem encountered in railway freight application system:based on Railway Freight Grid, how distributed computational resources can be efficiently used to extract knowledge from the freight data distributed at railway bureaus in order to support decision making.In our approach, data mining algorithms are decomposed as execution process mod-els which are composed of finer-grained data mining operators, and then the models are optimized according to the distribution of data and computational resources in Grid to get the distributed data mining execution process models; the execution engines schedule the models and assign the tasks to different nodes in the Grid, and users can get the data mining results via unified and Grid-compliant interfaces. In the thesis, based on Grid, the approach is used to process the following data mining algorithms:association rules mining, sequential patterns mining, CART classifier and naive Bayesian classifier.The major contributions of this thesis include: ·Data mining execution process model composed of finer-grained data mining oper-ators enabling to describe the execution process of data mining algorithms. Users, applications and execution environments can have a clue about the intermediate steps and intermediate results via the execution process models. The data min-ing operators are evaluated based on simulation data by the experiments which are executed in central environment, and the result shows that data mining execution process model can show the execution of every step of data mining algorithms.·The optimization algorithm proposing how to transform data mining execution pro-cess models to distributed ones which can execute in Grid, the optimization algo-rithm is divided into three sub-processes:data localization, global optimization and local optimization, and in every sub-process, data mining operators are optimized according to the type of operators and the distribution of data. Distributed data mining execution process models are evaluated based on simulation data in Grid, the results prove that distributed models can execute in shorter response time and use computational resource in more balanced way than centralized processing.·DMEP engine providing a runtime environment for data mining execution process models in Grid, in the engine,(a) the scheduling algorithm enabling to assign flow chains to Grid nodes and (b) WSRF-based model execution service and process control service enabling users to control the execution of flow are proposed. When distributed data mining execution processes are schedules by DMEP engine in Grid, the response time of flow chains are evaluated based on simulation data; an appli-cation example about predicting railway freight major clients are described, which uses freight waybill data and is deployed on Railway Freight Grid test bed.·The interface specification for accessing data mining services in Grid defined by OGSA-WS-DAI-DM enabling the seamless combination of data mining services and Grid, users can access data mining services in the same way as they access other services provided by Grid. An application example shows how to use WS-DAI-DM, and WS-DAI-DM has been submitted to Open Grid Forum.The conclusion and proposals for future work are listed at the end of the thesis.
Keywords/Search Tags:Grid, distributed data mining, data mining operator, execution processmodel, optimization, execution engine, interface specification
PDF Full Text Request
Related items