Font Size: a A A

Research On Extended Knowledge Discovery In High-Dimension And Sparse Outliers Set

Posted on:2008-10-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y F JinFull Text:PDF
GTID:1118360215490527Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data are considered as a kind of most valuable resource in this information society. Lots of useful knowledge is hidden in complex datasets, discovering and using these knowledge have become the preconditions of scientific decision. Data mining is a process of acquiring knowledge from large even massive datasets based on intelligent computer techniques; it mines the latent but useful knowledge by means of association rule mining, classifying and clustering.Outliers are observations that lie an abnormal distance from other values in a random sample from a population; they are obviously different from the routine observations, even outliers are doubted generated by another way. But outliers are not equal to wrong data; some outliers maybe contain very important information, outliers are primary objects of data analysis in areas such as credit card fraud, disease diagnosis, network intrusion detection, communication fraud analysis, fault detection and disaster prediction. In any investigation field, outliers maybe bring to us a new view, thereby some new theories or new applications are led, so the research of outliers is very significant. Existing researches of outliers generally focus on outlier mining, and few of them analyze outliers more, which aimed for a more dependable dataset by singling out the outliers.The study of outliers should include outlier mining and outlier analysis. And the main contribution of this thesis includes: based on the existing algorithms of outlier mining, some critical theories on high dimension and sparse outlier dataset were analyzed, such as classifying, origin, meaning, characteristic, and outlying trend. Base on rough set theory, some novel definitions, for example, key attribute subspace (KAS), have been proposed. Accordingly, some novel algorithms have been proposed, including outlying reduction, KAS searching, outlier clustering, missing value processing, and outlying trend analysis. In addition, based on high dimension and sparse outlier dataset, a whole frame of characteristic description and extended knowledge discovery. As an innovative work, this thesis tried it's best to breakthrough on research methods and research thoughts, the main results are outlined as following:①The theories and methods of outlier mining are analyzed and summarized roundly. A novel outlier detection algorithm is proposed based on the algorithm of k-nearest neighbor and a partition-based outlier mining algorithm is introduced in the thesis. Some statistical methods on outlier detection are analyzed and designed in detail, such as unary outlier detection algorithm based on likelihood, and outlier detection method based on multivariate regression analysis. And the techniques to process outliers in clustering algorithm are discussed from the view of outlier mining. Outlier detection is similar to imbalance classify and association rule mining based on infrequent pattern, so these similarities are analyzed.②Combining rough set theory, the characteristics of outlier object subspace are discovered from the view of outlying partition. Some concepts are proposed including outlying partition similarity and outlying reduction, the purpose is to search a lesser attribute subset, and discover the reasons and probability of happening outlier dataset from the subset. A novel method of outlying reduction based on genetic algorithm (GA) is proposed, which could resolve the searching of outlying reduction efficiently.③The theories and methods on KAS are discussed in detail including meaning, effect, and search method. Based on KAS, missing value, general outlier and noise are summarized into outlier object, if the KAS of one outlier is not null, this outlier is considered as a general outlier, if the KAS of one outlier does not exist, this outlier is noise. Some concepts are proposed such as outlier envelope, outlier core, and outlying status matrix of attribute values. Based on these concepts, a series of KAS search algorithms are designed including statistical KAS search of single outlier, single outlier KAS search algorithms based on maked attribute subspace, statistical KAS search of outlier set, KAS search of outlier set base on outlier core or outlying attribute frequency. In addition, the performance of these algorithms is analyzed and experimented.④Based on outlying shared attribute, outlier cluster is defined; three principles on outlier clustering are proposed including the quantity of clusters, the number of cluster objects and the similar degree. And then, based on KAS or outlying adjacency graph, some outlier clustering algorithms are proposed, the classification abilities and performances of these algorithms are experimented and compared. For outlier cluster analysis, the key attribute subspace analysis approaches are proposed from three factors including inside, outside and single. Based on outlying k-nearest neighbor, the outlier analysis method is discussed, and knowledge could be discovered from the relationship of outlying k-nearest neighbor and outlier cluster.⑤Object with missing value is researched as a special outlier object. Based on grey prediction model GM(1, 1), the grey interpolation reasoning approach for missing value in sequential data is proposed, the approach makes the best of all information in zone of missing point when each missing value is estimated, and the error correction model of interpolation value is developed, so there is a more preferable interpolation performance.⑥Outlying trend on sequential data is analyzed; the definitions of atomic outliers class and outlying mutation class are proposed, correspondingly, the general characteristic of each is identified. The outlier probability estimate approach on object is put forward. Combining KAS, the attribute outlying frequency is predicted.
Keywords/Search Tags:Outliers Mining and Analysis, Outlying Reduction, Key Attribute Subspace, Outliers Clustering, Grey Interpolation, Outlying Trend Analysis
PDF Full Text Request
Related items