Font Size: a A A

Research Of Outlier Testing Methods In High-Dimensional Dataspace

Posted on:2006-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2168360155952982Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer and network, data is playing a more and more important role in many areas. Especially data analysis and data test had become a research subject in technology of information—data mining. Most of the researches in data mining are concentrated in mining the normal pattern, and the researches bout the abnormal pattern are relative fewer. But the application of the outlier test in cheating test of e-business, intrusion detection in network and data cleaning etc. can't be ignored. There are several methods of outlier test in recent year, for example, distance-based, density-based, depth-based, deviation-based, statistic-based and pattern or expert system-based methods etc. However the disadvantage of the above methods is that they are not suitable in high dimensional-dataspace and dataset with great capacity. So for reasons of this I put forward two methods and they can detect the outliers in the dataset with high dimension and great capacity efficiently. The first method is the Weighted Hypergraph-based Outlier Test (WHOT). It is based upon weighted association rule mining using weighted support and significance Framework, and multilevel hypergraph partitioning algorithms. The main idea of the algorithm is following. As the traditional association rule mining model assumes that items have the same significance without taking account of their weight/attributes within a transaction or within the whole item space, but it is not always the case. I design the association rule mining algorithm—WApriori with the definition of weighted support. The efficiency of the algorithm in time and space is satisfying. Constructing hypergraph G=(V,E) with significant association rule in the dataset that mined by WApriori algorithm, therinto V is the vertice set corresponddding to the record set in dataset and E is hyperedges set corresponding to the significant itemset. I define the meaning of Window in the hypergraph and the way of calculate the weight of hyperedge. Then we get the cluster set C by using multiplevel hypergraph partitioning algorithm hMETIS on the weighted hypergraph. After that I define the measures whether a vertex in hypergraph or a record in dataset is an outlier, and they are the weighted support of a vertex to a window—WS, the weighted belongingness of a vertex to a cluste—WB and the deviation of size of a vertex to a window—WD. The thresholds of the above measures given by user are WSt, WBt and WDt, and if the measures of a vertex in a window have the relation that WSWBt and WD
Keywords/Search Tags:Data mining, Clustering, Outlier, High dimension, Hypergraph-based Partitioning, Pattern-based Clustering
PDF Full Text Request
Related items