Font Size: a A A

Local Outlier Data Mining And Application-related Subspace

Posted on:2015-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y H LiFull Text:PDF
GTID:2268330428477817Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of technology, data is exploding which makes mankindenter the era of big data. However, with the rapid growth and expanding the dimensions ofthe data, how to high quality and high efficiency obtain the desired information from largehigh-dimensional data has become a hot research field of data mining. In this paper,outlier data mining algorithms based on the related subspace with MapReduceprogramming model have been studied. The main results are as follows:1)A local outlier data mining algorithm based on the related subspace is presentedwhich adopts the measure factor of local sparse differences and local density differences.The algorithm determines local data set of each data object of the data set according toK-NN, and generates global and local sparse factor matrix according to the sparse factorof attribute values, that effectively reflects the degree of local sparse of data objects. Aftercomputing the local sparse difference factor of a data object’s attribute dimensions, thedata object‘s subspace definition vector can be derived from the local sparse factor matrix.In doing so, our algorithm is able to characterize data object’s arbitrarily related subspaces,which is used to determine the data object’s local density difference expressed as aGaussian error function. As a result, the "dimension disaster" effect can be significantlyalleviated. Outlier measurement in a related subspace is independent of a dataset’sdimension. The data objects’ outlierness can be measured from the perspective of anyrelevant subspace. Otherwise, the data object’s local density differences is set to zero toindicate that the object is a normal data. Data objects with the maximum local densitydifference (outlier degree) are selected as local outliers. Finally, UCI and stellar spectraldata sets are used to verify the effectiveness of the algorithm.2)A parallel local outlier minning algorithm based on the related subspace withMapReduce programming model is proposed. Firstly, the parallelization of PLOF isanalyzed and the implement with MapReduce is given; followed by a parallel miningalgorithm of local outliers based on MapReduce programming model is proposed whichadopts LSH distributed strategy. Finally, artificial data sets and stellar spectral data sets areused to verify the effectiveness of the algorithm, scalability, and scalability of the parallelalgorithm. 3) Based on the above research results, we design and implement of the visualizationprocess of astronomical spectra outliers mining based on the related subspace with JDKdevelopment tools, and describe the implementation techniques in detail. So as to providea new way for finding the unknown special objects.
Keywords/Search Tags:Related local subspace, high dimensional big data, local outlier data sets, MapReduce, probability density
PDF Full Text Request
Related items