Font Size: a A A

Research On Classification Algorithms For Weakly Usable Data

Posted on:2015-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2298330422990904Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of science and technology, the data size is increasing,especially in the application area of computers and Internet, which growsexponentially. Massive data brings a wealth of information and knowledge,meanwhile, a lot of quality issues, such as incompleteness, inconsistentness,inaccuracy, outdatedness, which seriously restricting the usability and value of data.Currently, data usability problems have attracted large interest and a lot of findingshave been reported. But these studies focused mainly on data cleaning and repair,the drawback of these methods is the huge time cost and the data can’t becompletely repaired. Due to inconsistencies between the repair targets and the usetargets, subsequent process may be skewed and error, and then unreliableconclusions may be drew, which is more serious in data mining. Thus, the poorusability of the data needed to be tolerated and ananlysis should be take directly.Nowadays, large numbers of data mining algorithms have been proposed, but thesealgorithms tend to assume that the data is highly usable, with little attention payedto the quality problem. To avoid errors introduced in cleaning and repair, and alsotake the quality of the data and the classification algorithm as a whole, this workfocus on how to classify directly on weak usable data.The paper mainly studies classification algorithms on incomplete and noisy data.For incomplete data, basic measures of completeness and measures based onentrophy were given, both of which measure the completeness from attribute, tuple,category, data set levels, and classification algorithms based on interval sets andinformation theory were proposed. Upper and lower bounds were used to describethe extent of combination of attributes and operatorations including union,intersection, difference were used to get classification rules, and the algorithm givesthe corresponding confidence interval for each rule. Information theory basedapproach models the classification process as an ongoing process to reduceuncertainty, the algorithm first calculates the initial uncertainty of categories, andthen uses attributes to eliminate uncertainty and finally classifies the instance to theclass whose uncertainty is minimum. For noisy data, noise generation mechanism was modeled and transition matrix was used to represent the model, and algorithmsfor data that following mix Gauss distribution were proposed. When the transitionmatrix is known, the classification parameters can be obtained by solving equationson the number of instances, the mean vector and covariance matrix of the model,otherwise paramenters can be acquired by an iterative EM algorithm, which will getmodels that best match the observed data. For each algorithm, experiments wereconducted to verify the effectiveness and feasibility.
Keywords/Search Tags:weak usable data, classification, interval set, information theory, label nosie
PDF Full Text Request
Related items