Font Size: a A A

Research On Reduct Technology And Feature Selection Algorithm In Data Mining

Posted on:2007-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2178360182996200Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining, as a multidisciplinary subject, is developing rapidly in recentyears. It is involved with database, statistics, artificial intelligence, machinelearning and so on. Its major task is to extract valuable knowledge and obtainmore available information.Knowledge is the basis of artificial intelligence. Symbolic reasoning is amain method to seek the solution. Reduct algorithms of attributes are based on adiscernibility matrix in rough set. Boolean reductions are major functions of thealgorithms. In digital circuit designs Boolean functions must be reduced. It isapparent that Boolean manipulations and Boolean reducts are applied in manyfields.After some algorithms for reduct of Boolean functions have been studied, analgorithm for reduct of Boolean functions based on primes is introduced in section2. Its main idea is that Boolean variables are represented by primes, and a basicconjunction is denoted by an ordered pair of integers. In the ordered pair, the firstelement, which is a product of all primes corresponding to all variables in thebasic conjunction, denotes a meet of all variables, and the second element, whichis a product of all primes corresponding to all variables' complements in the basicconjunction, denotes a meet of all variables' complements. If there are novariables either variables' complements in Boolean functions, the correspondingelement of the ordered pair is assigned to 1. Boolean calculation is replaced byarithmetic manipulation. So that the memory space is saved and the efficiency isimproved. The Boolean system is redefined, and the laws are described by axiomsand rules in new Boolean system. It is proved that the new Boolean system iscomplete. The algorithm for reduct of Boolean functions based on primes isapplied to rough set. The results indicate that the algorithm is efficient.With data, especially dimension of feature, is increasing, data mining has tosuffer from curse of dimensionality. So, the technology of feature selectionappeared. In real-world datasets, there are many features that are irrelevant orredundant to the target concept. These attributes should considerably slow the rateof data mining algorithms. Especially redundant features would worsen thepredictive accuracy of mining algorithms. So that feature selection is one ofimportant tasks in data mining. It can speed up a data mining algorithm, improvepredictive accuracy.After survey of feature selection, a feature selection algorithm based on χ~2statistic (FSBC) and another feature selection algorithm based on potentialdeviation (FSBPD) are proposed, in which χ~2 statistic is involved. χ~2 value isdirect utilized in the first one, and in the second one, χ~2 value is transformed intosignificance level of independence.In section 4, crosstab is introduced first, and the relation between Pearson χ2statistic and crosstab is discussed. After that, a feature selection algorithm basedon χ2 statistic (FSBC) is presented. Its main idea is to calculate Pearson χ2 statisticof each feature and class by crosstab. The larger the value is, the higher thecorrelation is. According to correlation to class, all features whose χ2 value is notequal to 0, are partitioned into three subsets, strong relevant set, relevant set, andweak relevant set. Find the weakest relevant feature in strong relevant set, andfind the strongest relevant feature in weak relevant set. The weakest of strongcorrelation is threshold of correlation, and the strongest of weak correlation isthreshold of independence. The three sets are treated differently. Redundantfeatures are deleted in strong relevant set. The reduced strong relevant set, inwhich pairwise features are independent, is obtained. Relevant set is filtered. Ifsome feature Fi in relevant set is correlative to any feature in the reduced strongrelevant set, Fi is deleted. The reduced relevant set is obtained. The weak relevantset is deleted completely. The subset of feature selection is the union of thereduced strong relevant set and the reduced relevant set. FSBC based on filtermodel which is more efficient than wrapper model, is independent of a certainmining algorithm. So it is universal. In the algorithm, there are no parameterswhich are difficult to choose. The result of FSBC is objective.In section 5, Correlation Probability (CP) is proposed. Calculate χ2 value ofeach feature and class via crosstab. χ2 value is transformed into significance levelof independence. CP is described by significance level of independence. In thiswork, Potential Deviation (PD) is also presented. According to correlation to classand reference feature, features in some subset are ordered in two lists. In the twolists, PD is position deviation of a feature Fi. According to definition of PD,features are divided into positive PD features, zero PD features and negative PDfeatures. Positive PD features are more correlative to class and less redundant tothe reference feature. Zero PD features are as correlative to class as redundant tothe reference feature. Negative PD features are more redundant. A featureselection algorithm based on PD (FSBPD) is presented. Its major task is to deletenegative PD features. Its features must be discrete. So, first of all, all features arediscretized via Kmeans. Because major task is feature selection, discretization isunsupervised, and distribution of data discretized tries to as same as before.FSBPD can remove redundant features. It can not warrant that irrelevant featurescould be removed, because it does not deal with irrelevant features. FSBPD is alsono parameters based on filter model.At the last of section 4 and section 5, the results are presented. They indicatethat the two feature selection algorithms can select good subset, but there are stillsome shortages to be studied further.In FSBC, compare of χ2 values has a limit that all degrees of freedom shouldbe consistent. It should be considered more that how to compare and avoid limitof degrees of freedom. The accuracy of results is dependent on the weakest ofstrong correlation and the strongest of weak correlation. How to accurately choosethe two values is also need to be discussed theoretically.FSBPD can remove redundant features effectively. However, it could notalways remove irrelevant features. So, it should be discussed how to decide athreshold to remove irrelevant features.
Keywords/Search Tags:data mining, reduct, prime, feature selection, χ~2
PDF Full Text Request
Related items