Font Size: a A A

Research On Mass Data Processing And Data Mining Key Technologies

Posted on:2016-05-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:1318330542474116Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining refers to the process of using algorithm to search hidden information from a large number of data by algorithm.The rapid development of information technology and the Internet has produced a large number of data.How to effectively store,process these data,and mining the hidden knowledge from these data is an important work.This thesis studies the large-scale mass data preprocessing methods,storage method,data deduplication method,and on this basis,through the knowledge of data mining technology on the implied in the mass data of mining research,specifically including the following aspects of work:1.In the face of the problems existing in the massive data,based on the technology of data preprocessing,compared the data compression,the incremental backup and data deduplication technology,mainly studied the technology of the data deduplication,proposed a delete duplicate data and adaptive optimization method based on K-Means.Firstly,using consistent hash algorithm in a distributed storage system,combining Bloom Filter structure algorithm used in a single query system,improving the efficiency of distributed data search index;at the same time,by improving the partitioning algorithm based on Rabin fingerprint,as well as the use of suffix adaptive data block optimization method,the data selection block method has better adaptability and data transmission effect;in addition,proposed a delete duplicate data method based on K-Means,accurately identify duplicate data,improve the efficiency of duplicate data detection and deletion.2.Applying clustering algorithm in data mining to the clustering research of after preprocessing and eliminating redundant data,we put forward Feature weighting and non-negative matrix factorization-Multi view Clustering(FWNMF-MC)algorithm.FWNMF-MC algorithm considering the characteristics of weight and high dimensional data in the multi view clustering process,according to the different characteristics of each feature and the importance of each perspective in the process of clustering,automatically endowed different weights.Dividing the feature matrix into basis matrix and coefficient matrix,then their multiplication can help map the high-dimensional space to the low-dimensional space.At the same time,maximize the consistency of each perspective in the low-dimensional space,in order to efficiently utilize the clustering structure of mining data from every perspective.Finally,the experiment shows that compared with the current algorithm,FWNMF-MC has better clustering effect and is suitable for handling mass data.3.Applying association rules in data mining to the research of after preprocessing and eliminating redundant data,Association Rules Mining based on Particle Swarm Optimization(ARM-PSO)is put forward.ARM-PSO is based on Particle swarm optimization strategy,firstly,the optimal threshold of each particle need to be found through ARM-PSO,and then these data will be conveyed to binary value to find the threshold with the minimal suitable and support and credibility.The experimental results show that the ARM-PSO algorithm can quickly and objectively give appropriate minimum support degree and confidence degree,while guarantee the mining efficiency,can obtain high quality association rules,suitable for dealing with massive data set of association rules mining.
Keywords/Search Tags:data mining, data deduplication, multi-view clustering, particle swarm optimization, association rule
PDF Full Text Request
Related items