Font Size: a A A

Related Technologies Research On Feature Selection For Text Categorization

Posted on:2010-10-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:B WangFull Text:PDF
GTID:1118360305473651Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, applications based on onlineand electronical texts have been populated and ubiquitous, including informationretrieval of Web news, spam email filtering based on content, sentiment analysis offorums, topic detection of blogs and so on. Consequently, to organize and mine textinformation effectively, text categorization based on content has been advocated. Inpractice, the intrinsic high dimensionality of text data may induce ineffective learning.So, dimension reduction has been a fertile research field, as an indispensable step in textcategorization.Feature selection is one of the important and frequently used techniques in datapreprocessing for data mining and model recognition. It reduces the number of features,by removing irrelevant, redundant, or noisy features, and brings the immediate effectsfor applications. Furthermore, supervised learning information (label information) is animportant proportion in text categorization. The essential characteristics of textcategorization, including complex relationship among classes, unbalanced classdistribution, shortage of class labels, and uncertain label information, propose morechallenges for feature selection.This dissertation takes text categorization as research background, puts emphasison above challenges of feature selection, and addresses several key problems underdifferent supervised learning models. The main contributions of this dissertation can besummarized as follows:(1) Under supervised learning model, to resolve the problem caused by complexrelationship among classes, we introduce a novel feature selection algorithm FSRRH forhierarchical text categorization. Assuming the classes are organized with a tree-likestructure, different training sets are extracted from different hierarchies, to solve theunbalanced class distribution problem. FSRRH employs the normalized InformationGain to choose feature subsets with various distinguishing abilities. To supportredundancy removal between feature substes, we also devise some adjustment toApproximate Markov Blanket. Experimental results show that, comparing with otherhierarchical feature selection methods, algorithm FSRRH can improve the classificationperformance by alleviating the unbalance problem.(2) Under semi-supervised learning model, to address the problem caused by lackof class labels, we show a semi-supervised feature selection algorithm SFRSC. Itdepends highly on both the few labeled samples and plenty of unlabeled samples, andplans the extension direction and extension range on the basis of theory of Relevant SetCorrelation. An integrative criterion is designed to measure the self-correlation in aclass and disperse between classes. An empirical study of our algorithm in terms of efficiency and scalability is presented, which verifies the advantages of SFRSC inlearning the amount of training information, comparing with other representativemethods.(3) In text categorization, feature selection algorithms may need to be re-designedduring the evolvement of supervied information. After an analysis of the inherent linkbetween feature selections under different supervised learning models, a featureselection method FSM_HSIC for multiple supervised learning models is investigated.Based on theory of HSIC, the non-linear correlation in low-dimension space is mappedinto linear correlation in high-dimension space with kernel functions. The differentconstructions of Gram matrix support the instantiation of FSM_HSIC, which producespractical algorithms under certain supervised learning model. We prove the utility ofFSM_HSIC by explaining some current algorithm theoretically. And what's more, anovel interactive feature selection algorithm FSI is explored based on FSM_HSIC.Some empirical validation of stability and convergence of FSI is applied. It can discoverthe interactive features efficiently, which verifies that FSM_HSIC can guide theproduction of new algorithms.(4) The researches above are all about precise text data. To deal with the problemcaused by uncertain supervised information, we advocate feature selection algorithmFSUNT. Firstly the uncertainty is presented in terms of probability or fuzzy entropy.Then the uncertain information is involved during feature selection based on HSIC. Theexperimental results show that algorithm FSUNT can measure the correlation betweenfeature and uncertain class labels effectively and steadily, comparing with tworepresentative algorithms.In summary, this dissertation has focused on the essential character of featureselection, which is a data-driven and application-driven field. We study the fourpressing issues that feature selection is confronted with in text categorization, which areunder different supervised learning models. In succession, we propose algorithms andmethods resolving the relevant problems. These works have academic and practicalvalue for advancing the theory and practicability of feature selection.
Keywords/Search Tags:Data mining, Pattern recognition, Text categorization, Feature selection, Supervised learning
PDF Full Text Request
Related items