Font Size: a A A

Research On Data Mining In The Scientific Data Grid

Posted on:2007-05-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q TongFull Text:PDF
GTID:1118360185954186Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the emergence and development of grid computing, it becomes possible to share dataand collaborate in a large scale model of cross-organization and cross-region. In the area ofscientific research, the problem of modern scientific research becomes more and morecomplex, which results in a brand-new scientific collaboration model and the large scienceproject, i.e., the infomationization of scientific research (e-Science). In order to share resourcesand products, and also collaborate to accomplish large scale modern scientific researches, it isnecessary to establish an allied virtual research group via the Internet based on grid computing.By using data mining technologies, this paper aims to improve the service level of theScientific Data Grid and the Scientific Database, based on their existing large-scale datastorage and powerful computing capabilities. The main research contents and contributions arelisted as follows.(1) Based on detailed analyses of the data mining properties of the Scientific Data Grid, ascientific data mining system is proposed. The system consists of three main components: theScientific Data Mining Architecture (SDMA), the Scientific Data Mining Toolkit (SDMK), andthe Scientific Data Mining Service (SDMS). SDMA describes the multi-dimension modelarchitecture of data mining applications;SDMK provides a large amount of data preprocessingand data mining algorithms;SDMS presents a data mining scheme to address the problemsunder grid environment through a form of grid service. Compared with traditional data miningsystems, the proposed system has many excellent properties, and is more suitable to theenvironment of the Scientific Data Grid and the Scientific Database. Nowadays, it has beenapplied in some real database applications. Besides the simple query and search functions, theproposed system can also perform more advanced functions such as data statistic, data analysis,and knowledge discovery. As a result, the service level of the database is improved.(2) Clustering in data mining is a discovery process which groups a set of data such thatthe intra-cluster similarity is maximized and the inter-cluster similarity is minimized. Thediscovered clusters can be used to explain the characteristics of data distribution. To address theproblem that existing boolean association rules based mining algorithms cannot deal withquantitative and categorical data directly, in this paper, a novel association rules miningalgorithm is proposed. This algorithm first divides all transactions of a database into differentclusters, and then projects these clusters into the domains of the quantitative attributes to formmeaningful intervals which may be overlapped. Experimental results show that the proposedapproach can not only find quantitative association rules efficiently, and but also find importantassociation rules which may be missed by the previous algorithms.(3) Due to the complexity and inaccuracy of user identification and session identificationin the traditional user access pattern mining systems, this paper proposes a filter based useraccess pattern mining system, which can identify a user and a session accurately, and provideshigh quality data for the mining algorithms. The paper also describes the implementation anddeployment of the log filter, and proposes a web access pattern mining algorithm. Nowadays,the proposed system has been used in the Scientific Database, with outperformed performancecompared with the previous ones.
Keywords/Search Tags:scientific data grid, scientific data mining system, grid service, multi-dimension model, data preprocessing, quantitative association rules, clustering, classifying, sequence pattern, filter, access pattern
PDF Full Text Request
Related items