Font Size: a A A

Research On Data Stream Dimensionality Reduction Algorithm

Posted on:2017-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y ShanFull Text:PDF
GTID:2308330488497131Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of modern information technology, many applications place greater demands on the real-time processing of data. Meanwhile, data stream, a new and more realistic data model, has been widely used in many application areas. Usually, the stream data is high dimension data with a lot of redundant features, which will affects the efficiency of machine learning and data mining algorithms greatly. Dimensionality reduction algorithm, as an important means of high-dimensional data processing, can effectively eliminate redundant features to improve the efficiency and performance of mining algorithms. However, traditional dimensionality reduction algorithms cannot meet the demands of real-time processing, so how to find an effective and real-time dimensionality reduction algorithm has become a research hot topic. Against the characteristics of data stream, this thesis proposes two dimensionality reduction algorithms(linear and nonlinear) to meet different requirements.On the one hand, this thesis does intensive research on the classical dimensionality reduction algorithm PCA and analyses the shortages in process efficiency and single data type. Then, combined with the characteristics of real-time and infinity in data stream, this thesis proposes a data stream dimensionality reduction algorithm based on principal component analysis(SPCA). The algorithm can adapt to the dynamic changes of the speed in data stream. It not only can eliminate redundancies efficiently in data stream but also can process the data with mixed attributes.On the other hand, based on SPCA, this thesis further improves the correlation coefficient matrix formula of principal component analysis and then does a distributed parallel on the improved formula and linear projection to propose a new data stream dimensionality reduction algorithm DPSPCA(Distributed Parallel SPCA), which can be easily used in distributed or parallelized environment. Then we use the distributed stream processing platform Storm to conduct experiments on DPSPCA. The experiment results show that DPSPCA can increase the processing speed efficiently.Last but not least, to meet the diversity of data stream and cover the shortage of SPCA(it can only process liner data), this thesis proposes a data stream dimensionality reduction algorithm based on kernel principal component analysis(SKPCA). The similar to SPCA, SKPCA does some improvement on kernel principal component analysis algorithm KPCA(setting a threshold value to determine the speed of dynamic data stream and use different methods to calculate kernel matrix) to meet the demands of data stream.In summary, this thesis has certain theory and utility on dimensionality reduction algorithms in data stream. The algorithms proposed in this thesis not only can reduce the dimensionality of data attributes effectively but also can reduce the requirement of space and increase the processing efficiency. With the use of stream processing platform, the efficiency of these algorithms can be increased further.
Keywords/Search Tags:Data Stream, Dimensionality Rreduction, Linear Principal Component Analysis, Nonlinear Kernel Principal Component Analysis, Storm Platform
PDF Full Text Request
Related items