Font Size: a A A

Research On Dimensionality Reduction And Clustering Algorithm Of Commercial Data Streams

Posted on:2012-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z MeiFull Text:PDF
GTID:2189330332483136Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In the late 20th century, data streams are widely used in business areas as a new and more realistic data model. The data streams have characteristics of large number, unlimited, concept drift, rapid change, need rapid response, and large cost of random access. In adition, it contains valuable information of enterprise, such as the operation laws, management requirements, influencing factors, and variation trends, better reflects the business operation, service contents, service targets and other dynamic changes. At the same time, these infinite and variability data streams also brought some challenges to computer storage space, computing speed and communication capacity. The data mining technology has made a lot of results in mining static data sets, but expanded to the dynamic data streams mining, especially the dynamic commercial is still a great challenge.In the dynamic data streams environment, the rapid growth of data and higher dimension lead to current existing algorithm function against small amount and low-dimensional space of data declined rapidly, and similarity measure of low-dimensional space will be no longer exists. This paper uses sliding window as data streams uniform management model. First, in view of dimensionality reduction for data streams, this paper comment and review thoroughly for high dimensional reduction from two aspects, they are feature extraction and feature selection, and analyse the latest six research trends on dimensionality reduction. At the same time, in view of data clustering, this paper make a comparative analysis of clustering algorithms from both aspects of traditional static and dynamic data streams. Then we design two methods of the high dimensional reduction based on review of previous research in chapter II. The first is based on rough set theory for dimensionality reduction, it compress from two aspects of affairs and dimensions. On the one hand, it compresses the affairs under maintaining the dimensional characteristics, increase dentification capabilities between affairs. On the other hand, through testing the hypothesis between the correlations of dimensions, effectively removed the dimensions influenceless on the decision result. The second is a method of commercial data processing based on equivalence class of rough sets, which uses the characteristics of the relative independence between condition attributes in decision-making table to carry on reduction. It's a a new dimensionality reduction algorithm, and make the sample analysis on partial data of customer's evaluation table, the experiment show that the algorithm can reduce dimensionality effectively on the premise that preserve the original information. Finally, a method for data streams clustering in the constraints of limited resources is investigated, and design an improved clustering PDStream algorithm for dynamic data streams based on principal component analysis and density. It uses two-stage model for clustering operations, uses summary data to execute simply second clustering and update the clustering results. Experiments show that, PDStream algorithm has the superiority of handling massive data and the characteristics of high-quality clustering, and apply PDStream algorithm to a commercial field based on life cycle of data mining, achieved anticipated effect.
Keywords/Search Tags:high dimensional data streams, dimension reduction, rough set, density, data streams clustering
PDF Full Text Request
Related items