Research On Classification Algorithms For Weakly Usable Data

Posted on:2015-10-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y Chen

Full Text:PDF

GTID:2298330422990904

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of science and technology, the data size is increasing,especially in the application area of computers and Internetï¼Œ which growsexponentially. Massive data brings a wealth of information and knowledge,meanwhile, a lot of quality issues, such as incompleteness, inconsistentness,inaccuracy, outdatedness, which seriously restricting the usability and value of data.Currently, data usability problems have attracted large interest and a lot of findingshave been reported. But these studies focused mainly on data cleaning and repair,the drawback of these methods is the huge time cost and the data canâ€™t becompletely repaired. Due to inconsistencies between the repair targets and the usetargetsï¼Œ subsequent process may be skewed and error, and then unreliableconclusions may be drew, which is more serious in data mining. Thus, the poorusability of the data needed to be tolerated and ananlysis should be take directly.Nowadays, large numbers of data mining algorithms have been proposed, but thesealgorithms tend to assume that the data is highly usable, with little attention payedto the quality problem. To avoid errors introduced in cleaning and repair, and alsotake the quality of the data and the classification algorithm as a whole, this workfocus on how to classify directly on weak usable data.The paper mainly studies classification algorithms on incomplete and noisy data.For incomplete data, basic measures of completeness and measures based onentrophy were given, both of which measure the completeness from attribute, tuple,category, data set levels, and classification algorithms based on interval sets andinformation theory were proposed. Upper and lower bounds were used to describethe extent of combination of attributes and operatorations including union,intersection, difference were used to get classification rules, and the algorithm givesthe corresponding confidence interval for each rule. Information theory basedapproach models the classification process as an ongoing process to reduceuncertainty, the algorithm first calculates the initial uncertainty of categories, andthen uses attributes to eliminate uncertainty and finally classifies the instance to theclass whose uncertainty is minimum. For noisy data, noise generation mechanism was modeled and transition matrix was used to represent the model, and algorithmsfor data that following mix Gauss distribution were proposed. When the transitionmatrix is known, the classification parameters can be obtained by solving equationson the number of instances, the mean vector and covariance matrix of the model,otherwise paramenters can be acquired by an iterative EM algorithm, which will getmodels that best match the observed data. For each algorithm, experiments wereconducted to verify the effectiveness and feasibility.

Keywords/Search Tags:

weak usable data, classification, interval set, information theory, label nosie

PDF Full Text Request

Related items

1	Research On Multi-label Classification With Incomplete Label Information
2	Research On Multi Label Image Classification Method Based On Incomplete Label
3	Feature Selection Research For Multi-label And Weak-label Based On Fuzzy Entroy
4	Research On Multi-label Active Learning Under Weak Labeled Condition
5	Research On Classification Over Uncertain Data
6	Representation And Classification Of Interval-valued Data
7	Rough Set Model Based On Weak Similarity Relation In Incomplete Interval-valued Information System
8	Research On Multi-label Text Classification By Integrating Label Informatio
9	Research On Ordinal Classification And Regression Of Interval Valued Data
10	Multi-label Prediction Model Based On Ontology Database And Data Mining In Bio-medicine