Font Size: a A A

Research On Uncertain Data Clustering Algorithm And Its Parallelization

Posted on:2020-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:S Y HeFull Text:PDF
GTID:2428330590471601Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
A large amount of data is generated in the process of data transmission in the Internet,data collection in sensor networks,and transaction records in the financial business field.Uncertain data accounts for a large percentage.So in recent years,the study of uncertainty data has received more and more attention.Because in the real world,uncertainty data always exists.The uncertainty of these data will have a certain impact on the final clustering results and cannot be ignored.So how to effectively deal with uncertain data has become a research hotspot.In the research of uncertain data clustering algorithm,it is generally necessary to assume that the uncertain data obeys a certain distribution,so we can obtain the probability density function or probability distribution function which represents the uncertain data.However,this assumption is difficult to guarantee consistent with the actual distribution of uncertain data.This approach makes clustering quality and computationally inefficient.Existing algorithms based on density are sensitive to initial parameters.Existing algorithms can't find class clusters of arbitrary density when clustering uncertain data with uneven density.Most of the existing algorithms can only be run in a single machine.These algorithms cannot meet the needs of big data processing.In view of these shortcomings,the main work of this thesis is as follows:,this thesis improves the traditional layering density clustering algorithm OPTICS(Ordering Points To Identify the Clustering Structure,OPTICS).Furthermore,an uncertain data clustering algorithm UD-OPTICS(Uncertain Data OPTICS,UD-OPTICS)based on interval number is proposed.The improved algorithm uses the interval number theory and combines the statistical information of the uncertain data to represent the uncertain data more accurately.The concept of interval core distance and interval reachable distance with low complexity is proposed.We study the distance formula between interval numbers.And it is used to calculate the above distance.Then they are used to measure the similarity between the uncertain data and ordering object to identify the cluster structure.The experimental results show that compared with the comparison algorithm,the clustering quality of the improved algorithm is increased by 15.33% on average,and the clustering quality on the dataset with uneven density is increased by 23.91%.In order to solve the problem of serial operation for the UD-OPTICS algorithm cannot meet the needs of big data clustering.In this thesis,the UD-OPTICS algorithm is combined with the Hadoop platform,and an efficient parallel uncertain data clustering algorithm HUD-OPTICS is proposed.The HUD-OPTICS algorithm uses the MapReduce model to implement parallel computing and uses the improved PRBP method of data partitioning to partition the data set with minimum boundary points and equalization,which provides guarantee for the load balancing of each node and the efficient operation of the algorithm.We built the Hadoop platform for experiments,and the results show that the HUD-OPTICS algorithm can meet the needs of clustering uncertain big data.
Keywords/Search Tags:Uncertain data, clustering, OPTICS, parallelization, big data
PDF Full Text Request
Related items