Font Size: a A A

Research On The Key Problems Of Canonical Correlation Analysis For Multidimensional Data Streams

Posted on:2015-08-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:W P LiFull Text:PDF
GTID:1318330518472861Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a special kind of data,data stream generates broadly from many application fields such as sensor monitoring,moving objects tracking,network log analysing,stocks trading and so on.In the environment of data stream,data arrives continuously and quickly.It is not possible to store all records of data stream.Designing a single pass algorithm is a basic requirement for lots of mining tasks,which challenges the research on data stream.In recent years,data stream has been increasingly attractive because of both the universality of its used and the challenge of its research and has become a hotspot in the field of data mining.Previous studies showed that data streams are usually correlated with each other and low-dimensional structures are usually hidden in the observations.As a powerful multivariate statistical method,canonical correlation analysis(CCA)can not only detect the correlation among two sets of data but also can extracte the low-dimensional features of them.The research on CCA for multidimensional data streams(MDS)is one of the hottest and most advanced topics in the field of data stream.Scholars in the early study of CCA for MDS produced numerous meaningful exploration results,which are proposed depending on different techniques such as Low Rank Approximations,Unequal Probability Sampling,Singular Value Decomposition(SVD),Graphic Processing Unit(GPU),and so on.Although these achievements greatly promoted the development and application of the CCA for MDS,they cannot meet the needs of someemerging fields,for instances quickly tracking the correlation among MDS in real-time environment,extracting the low-dimensional features of MDS in the context of dynamic data field,fast solving the CCA for MDS in the big data scenarios,applications of CCA in the field of privacy preservation for MDS,and so on.Therefore,the study on the extension about models and applications of CCA for MDS is far more than a theoretical one in academic study but one of the data mining in existing practical application.In this thesis,we mainly study the following aspects.Firstly,the traditional CCA methods cannot satisfy the practice requirement with slowly efficiency because of its incapacity to maintain the previous states and to be continuously updated,which greatly hinders the efficiency to tracking the correlation and to extracting the low-dimensional features in real-time environment.To address the problem,we proposed a novel algorithm of CCA to rapidly tracking the correlations and the low-dimensional features for MDS.The proposed algorithm uses the continuously updating and parallel solving capacities of rank two modifications to culculate the feature subspaces of covariance matrix and implements fast real-time tracking of the correlation of MDS.Furthermore,it can maintain the states of previous steps and has a low complexity which is independent on the size of the problem.The simulation results show that the proposed algorithm can achieve better reliability,higher computation efficiency and accuracy.Secondly,the traditional CCA methods ignore the effect of data field while extracting the low-dimensional features from MDS,which results in the failure to reveal some peculiar properties of low-dimensional features attributed to the interaction of data in the context of data field.In the face of the problem,we proposed a novel approach to solving the CCA in the context of dynamic data field based on the technique of Enzymatic Numerical P System(ENPS).The proposed approach considers the action of data field when formulating and deducing the equation of this new CCA model.The features extracted by the new model have good distribution characteristics and by which this model has better capacity of identifying class boundaries.Moreover,in order to quickly handle the data stream,we further proposed a novel ENPS to improve the performance for computing the potential of data field.The novel ENPS is revised from the classical ENPS by introduced the character variables and evolutionary programs of Transition P System which,in common with the ENPS,is the new finding of the nature-inspired computation field.Compared with the classical ENPS,the revised one has higher controllability of processes.The advantage of the novel ENPS stems from its maximum parallelism by which the potential of data field is calculated quickly only in three steps and the time of each step is not relevant to the data size.The efficiency of the new CCA model considered with the data field is raised sharply by this advantage.Thirdly,the traditional CCA methods can not longer meet the demands of big data which characterized by peta-bytes size and by sparse-values within it.Without doubt,data stream is a typical big data.To solve the CCA for MDS in the big data enviroment effectively,we proposed a novel CCA approach by introduced the cloud theory.At first,we proposed a distributed architecture based on cloud computing as a basic platform to store and process the big data.Then,we generate clouds(where cloud is a synopsis of data,which is a concept coming from the cloud theory)in parallel on the distributed architecture by multidimensional backward cloud generator(MBCG).All clouds are transferred to a center nodes and combined into a center cloud by cloud combination operation.A type of virtual sample of data called cloud drops are created based on the center cloud.Finally,the computing of CCA is imposed on the cloud drops.The cloud drops have less size compared with original big data.So the execution efficiency of CCA is improved significantly.In the process,two issues have to be mentioned here.On the one hand,in order to improve the efficiency of clouds generating,we proposed a heuristic strategy of clouds generating to improve the MBCG.Two key parts of the strategy are the incremental updating and diversity measure of clouds.For the former,we deduced an expression for updating the clouds;and for the later,we proposed two types of diversity measures based on chordal metric and subspace respectively.On the other hand,for overcoming disadvantage of traditional cloud combination operation which can only combine a pair of clouds at a time,we proposed a cloud combining approach with a way to combine multiple clouds one-time.Experimental results showed that the proposed CCA approach sacrifices the system resources for the acquirement of a certain accuracy and for a faster processing speed,and that the sparse-values characteristic can be revealed by the correlations.Fourthly,CCA has constantly been used in many new frontier throughout its long history,but less reports about its applications on the privacy preservation for data streams were issued.The existing methods of personal privacy preservation fail to consider the implicit relationships among trajectories with different privacy demands.This omission may reduce the quality of trajectories.To satisfy the personalized requirement of different privacy preserving demands for different people,we proposed an algorithm to preserve the trajectories personally based on the CCA.In the proposed algorithm,the trajectories which are considered to be insensitive by data producers are published directly.And the protection of privacy is only imposed on the trajectories which are considered to be sensitive by data producers.To this end,a latent variable is firstly obtained from both the insensitive and sensitive trajectories.Based on the latent variable,lots of trajectories are generated to replace the sensitive trajectories.The advantage of this algorithm is that it not only can respect privacy preserving wishes of data producers but also can obtain high quality trajectories.
Keywords/Search Tags:Data stream, Canonical correlation analysis(CCA), Big data, Privacy preservation, Membrane computing, Cloud theory
PDF Full Text Request
Related items