Font Size: a A A

Design And Implementation Of PMML-Based Data Mining Platform And Research Of Storage Model

Posted on:2008-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q Q WangFull Text:PDF
GTID:2178360212997005Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the society coming into the information age, and the comprehensive applications of the computer network and computer technology, the database in every walk of life accumulates substantive data increasingly. But how to make use of these data and withdraw an useful information and knowledge from tremendous amount of data to guide production, sale and decision of business enterprise, thereout developed a new computer technology-Data Mining Technology which is widely used and has huge practicality.It is necessary trend of data mining that to design and develop an intelligent, multi-strategy and multi-standards data mining platform. Under the situation that the data scale continuously inflates to increase day by day, there are a lot of successful data mining systems, for example, SAS Enterprise Miner, SPSS Clementine, Miner of IBM Intelligent etc. In the last decade, data mining systems had applied successfully in many businesses and research fields. It not only instructed the management and development of enterprise and brought a huge economic benefit for the business enterprise, but also made tremendous contributions for researching data mining technology.It is a bridge between research and application of data mining. And it is important to popularize data mining technology too. The PMML (the Markup Language of the Predictive Model) is a kind of XML-based language and is used to define predictive models. It provides a fast and easy way to share the models between all companies and different data mining applications. In order to standardize DBIN Miner in this thesis and share data and result models, the data mining platform integrates the PMML standard.This thesis mainly designs and implements a data mining platform based on PMML standard——DBIN Miner. While doing thorough research and implementing storage management module of data mining platform, this thesis proposes the conception of data mining storage model.Firstly, the thesis introduces data mining technology and system. It generally summarizes the background and development situation of data mining technology and system as well as the status of data mining system in domestic and foreign. It also analyses the problem of data mining system facing and development trends in the future.Secondly, this thesis studies data mining standards. It focuses on researching and designing the architecture of data mining system based on PMML standard. On aspects of processing standard, model definition standard, web standard, standard API and grid service standard etc., this thesis discusses standards and categories of data mining. Owing to the urgent demand of developing data mining platform, PMML standard was applied to the system firstly, and becomes most popular model management standard.Again, designs and implements data mining platform based on PMML standard——DBIN Miner. Under the guidance of architecture of data mining platform based on multi-standards, the platform was divided into three function parts that is GUI of upper level, storage management module of bottom level, and algorithm model management module of middle level. According to the norm of the CRISP-DM, this system implements the partial flow from business data analysis to result model deployment. By making use of PMML standard, this thesis develops an extensible data mining platform which has certain profession standard and multi-strategy.At last, this thesis researches the data mining storage model. According to storage management module of data mining platform, in order to resolve the problem of long average running time of algorithm and accessing to database frequently, and also for solving the problem of storage and sharing of algorithm model, this thesis proposes mid-processing storage based on cache mechanism and PMML-based model storage pattern, and researches a cache strategy adapt to k-means algorithm. This strategy can improve performance of data mining platform and efficiency of algorithm.This thesis proposes cache-based storage model applied to k-means algorithm for the first time. Firstly, Least Access (short of LA) strategy is proposed for k-means algorithm. When page replacement occurs, pages of least modified records are replaced, and then write to database. This strategy guarantees to reduce the times of database I/O. And then reduce unnecessary time cost.At the same time, on the basis of LA strategy this thesis proposes File Cache (short of FC) strategy of k-means algorithm. Comparing to LA strategy, FC strategy increases a secondary level cache which consists of many files. When page replacement occurs, if data of cache needing exist in files, data in files are read and wrote to cache directly, otherwise, read from database. This strategy reduces the time of accessing to and updating database too.The replacement strategy based on cache mechanism was experimented on a general PC. This thesis analyses average running time of k-means algorithm respectively based on no cache, LA strategy and FC strategy. And test the algorithm running time, database accessing and updating time based on LA and FC strategy as well as the extensibility of k-means based on these three kinds of strategies.For UCI Abalone dataset of 600KB size, according to the experiment of average running time of algorithm based on these three strategies, the running time of no cache strategy algorithm maintains 1800s. As cache size over 50KB, the average running time of FC strategy is less than 195.4s, while LA strategy is always longer than FC strategy. Therefore, FC strategy is an improvement on the basis of LA strategy, and experiment demonstrates that FC strategy has better performance and stability than LA strategy.The experiment of k-mean algorithm running time and database accessing and updating time for the same dataset above shows that, while cache size is 70% of dataset, the database accessing time of LA strategy begins to decrease, however, these two kinds of time of FC strategy both are less than LA strategy all the time, its average I/O time is 140s. And the algorithm running time of these two kinds of strategies both take on descending trend. Because the FC strategy considers outer files as secondary level cache, it optimizes the times of accessing to database of algorithm, consequently better resolves the problem which the data mining storage management bring.According to the experiment of extensibility of k-means, no cache algorithm only can support dataset less than 2M and LA strategy can support dataset less than 6M, while improved FC strategy has better extensibility, which can support dataset more than 10M.Therefore, in a certain degree the LA strategy and FC strategy relax the problem of k-means accessing to and updating database frequently. And the FC strategy provides a more stable and reliable cache strategy for k-means. This storage model also proposes result storage and share standards based on PMML standard. This thesis gives PMML storage pattern of association rules and k-means algorithm. Many researches show that PMML-based model management is the most popular and standard criterion.In conclusion, this thesis designs and implements a data mining platform based on PMML standard, and then propose the concept of data mining storage model as well as researching the mid-process storage based on cache mechanism and algorithm result storage pattern based on PMML. This model is expected for applying to more algorithm models.
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items