Font Size: a A A

Research On Data Mining Algorithms For Scientific Computational Time Varying Data Sets

Posted on:2010-06-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Q WuFull Text:PDF
GTID:1118360278476503Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
The development of parallel computers enables high-resolution numerical simulations with output of very large scale time varying data set streams. A data set stream contains a series of subsets which denote discrete numerical solutions of physical problems at certain time points. The total amount of data may reach hundreds of billions of bytes(GB) or even terabytes(TB). With such large scale time varying data sets, how to quickly carry out the physical analysis, transfer the data into knowledge and discovery new physical phenomena, reveal new physical laws, explore new physical mechanism becomes an elementary problem for scientific computing research.Along with traditional visualization analysis, data mining algorithms can be used to improve the efficiency of data analysis. Usually, data mining algorithms can quickly identify important physical moments and local regions of interest and find the correlation between physical variables. Data mining algorithms is becoming a key supporting technology for data analysis. However, existing commercial data mining algorithms are not suitable for data analysis in numerical computing, since they usually depend on the association rules of attributes in database. Therefore, it is necessary to develop new data mining algorithms for time-varying scientific datasets.In scientific data analysis, data mining can at least achieve three goals. Firstly, the comparison of similarity between any two adjacent subsets. Second, the identification of sub-regions and time steps which may contain wealth knowledge. Third, the determination of relation between any two physical variables. These three aspects are significant for physical data analysis. As they can rapidly identify implicit time steps or sub-regions which contain important physical characteristics and find linearity or nonlinearity between physics variables. At the same time, the speed and efficiency of data analysis can be improved and the difficulty be reduced.In information theory, the qualification of information is measured by Shannon entropy. Shannon entropy does not depend on many properties such as the dimension, location, and measurement unit and can quantize the intrinsic characteristics in data. So Shannon entropy can be used to describe the information contained in scientific data sets which is essential for the identification of potential useful moments or sub-regions. Therefore, information theory can serve as the basis of data mining.Based on information theory, this dissertation aims at three goals of data mining in scientific data analysis and carries out researches on the reduction of the time varying data set streams, change detection and non-linear relation detection. The main contributions of this dissertation are summarized as follows:(1) In the dissertation, we analyze the feasibility of applying information measurement to data mining for scientific computing data set. A new construction algorithm of non-uniform histogram is proposed which generates probability distribution of scientific data sets by iteration.(2) A data reduction algorithm for time varying data set is proposed. The algorithm uses information related measurement to measure relevance between subsets and only store low level relation subsets. Those subsets embeds important physical characteristics. The algorithm is applied to the data of laser plasma interaction simulation and satisfying results are obtained.(3) A change detection data mining algorithm for time varying data sets is proposed based on the concept of interaction information distance. The algorithm can identify important time step or sub-region and reduce the data analysis or visualization workload. Applied to Gaussian sequence and laser plasma interaction simulation data, satisfying results are obtained.(4) A non-linear relation detection algorithm is proposed based on information redundancy. AAFT is introduced to generate surrogate data and information redundancy is used as statistical test for statistical testing. Experiments based on several common time series validate the effectiveness of mining nonlinearity of one or two physical variables.
Keywords/Search Tags:Scientific computing, data mining, information theory
PDF Full Text Request
Related items