Research On Aggregation For Complex Query Based On Data Cube

Posted on:2008-03-07

Degree:Master

Type:Thesis

Country:China

Candidate:R F Wang

Full Text:PDF

GTID:2178360215983330

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Data mining is a new multidisciplinary field, which emerged during the late 1980s and has attracted a great deal of attention in the information industry in recent years. Data mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material, such as data bases, data warehouse and other kinds of information repository. It also is treated as other popularly term, Knowledge Discovery in Databases, or KDD. One of the major features of data mining is the majority storage of its data objects used in the mining tasks, such as data warehouse and tremendous transaction databases. It is well known that the multidimensional attributes in data bases characterizes the data objects.We category the tasks of data mining into two classes, single-dimensional data mining and multi-dimensional data mining, according to the number of attributes used in data mining techniques. The former only analyses data objects with one of its dimensions in tuples, taking traditional association rule mining for example in which only use product ID to mine the co-purchased products by one customer and the later, on the country, is used multiple dimensional attributes in order to mine more interesting knowledge and information from the data bases. Therefore, data mining in multi-dimensions is the development of data mining and many data analysis tools and support systems are based on it, such as On-line Analytical Processing (OLAP), On-line Analytical Mining (OLAM) and multi-dimensional data mining (MDDM). Moreover, more and more data mining tasks are developed from single dimensional data to multi-dimensional one, including multidimensional-association rule, multidimensional-clusters and multidimensional-outliers detection.Datacube, the model of multidimensional data, is gained more attention in the multi-dimensional data techniques and became one of the most important focus in which many scholars are interested for it not only models but facilitates the aggregation across the many sets of dimensions in data cube query. The core of the data cubes technique is the efficient computation of aggregation on multi-granularities that are referred to as group-by's in SQL terms. Queries based on data cube are the main functions in support systems by analyzing data in different dimensions and hierarchies. In our knowledge, many researches in aggregation of queries based on data cubes are developed and many algorithms are produced accordingly, such as Array-based cubes algorithm, BUC algorithm, etc.In our study, however, most of those methods are focus in simple query based on data cube (simple-cube-query in short) containing one sub query only and rarely attention on complex query based on data cube(complex-cube-query in short) involving multiple sub query tasks.The weapon of the competition is the information percentage, and the complex cube queries become the further selections of the users for the purpose of more information retrieve. Therefore, research and implement on complex cube query are the further development of the data cube.Unfortunately,there are only a few research on complex query by now. In [21], an expansion SQL which based on the standard SQL is proposed to describe the complex cube query. And the only algorithm we can find on distribute and algebraic complex cube query is introduced in reference [22], derived from the algorithm on distribute or algebraic simple cube query. So far, there are not efficient techniques for holistic complex cube query, not mention to the aggregation techniques according to the specific characteristics in holistic complex cube query. Furthermore, the algorithm mentioned in the literatures is not efficient enough because if it does not consider the different features of the variety of complex cube queries.Moreover, most of the researches are focus on the aggregation on all-granularities and few about aggregation on part ones. In despite of the advantage on observing all hierarchies of the data by aggregation on all granularities, there are also some disadvantages, involving time-consuming and no alternative for users, etc. On the contrary, part granularities'aggregation takes the user's choices into count and become the trend of the development of data cube.On the other hand, data mining on multi-databases become more and more popular in recent years in which the local knowledge may be mined from distributed databases and be fully analyzed from it later. A new idea of aggregation on multi-data cube for the multi-databases mining is created, conforming to the development of techniques on data mining and data bases. In this thesis, we focus on the aggregate methods for complex cube query, especially for the holistic one basing on the thoroughly analysis the dependent aggregate characters in the holistic complex cube query. We also present the frameworks for the further development on part-granularity aggregation and multi-cube aggregation. Both of them are in the same circumstances of complex query.The following are the main contents of this thesis:1) An efficient algorithm, PDIC, is present to compute the aggregation on multi-granularities for holistic complex cube query based on the carefully work on the different examples.The algorithm is based on three strategies in the computation of holistic complex cube query, involving part-distributive aggregate property (PDAP), Iceberg-query techniques and caching overlapping- reuse techniques. Extensive experiments on synthetic and real data sets are conducted to evaluate the PDIC and the results show that our method is promising and efficient. 2) Three methods, all-caching, part-caching and anti-caching, are proposed to optimize the computation of the three types of complex cube queries on the basis of the all-overlapping reuse, part-overlapping reuse and anti-overlapping reuse.The main superiority of the methods is that it can minimize the memory used and improve the complex cube query efficiently based on the caching reused techniques. Furthermore, it can fit to the three types of complex cube queries.3) New schemes of further development of complex cube query are proposed, in which involve computation on the part-granularities aggregation and the aggregation of multi-data cube.Instead of observing all of the granularities in the cube, many users may be only interested in some of the granularities and pay attention to the information hiding in curtain cube cuboids. Comparing to the traditional techniques on all-granularities aggregation, aggregation on part-ones can shorten the time-consuming and improve the waiting cost of query and also satisfy the users'request. In this thesis, we present the method and tentative plan for the purpose of exploring a feasible way on partly granularities aggregation. Further research and experiment shall be carried on in the future. In the scheme, parallel aggregation is proposed according to the distributive storage of the data sets. And also the practical methods and implementations will be realized in the future.

Keywords/Search Tags:

multi-dimensional mining, data cube, complex query, granular computation

PDF Full Text Request

Related items

1	Massive Data Aggregation And Parallel Implementation With Complex Constraints
2	Research On Key Methods Of Efficient Multi-dimensional Online Analytical Processing Query
3	WSN Multidimensional Complex Query Processing Analysis And Verification
4	Research On Multi-dimensional Association Rules Mining In Distributed Environments Based On Advanced Sql Query
5	Research On Methods Of Data Mining Based On Granular Computing
6	Efficient Data-Cube Computation And Application In OLAP MINING
7	Research On The Efficient Materialization And Fast Query Of Condensed Data Cube
8	Associations Mining Research Based On Granular Computing
9	Research On Multi-Dimension Query Analysis Algorithm
10	Research On Multi-dimensional Association Rules Mining