Font Size: a A A

Research On Feature Selection Methods In Information Filtering System

Posted on:2009-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:M F WangFull Text:PDF
GTID:2178360242494752Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and the spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how people organize and process large amount of document data, and find the interesting information for users quickly, exactly and fully. As the key technology in organizing and processing large amount of document data, network Information filtering technology can solve the problem of information disorder to a great extent, and is convenient for users to find the required information quickly. Recently, for the study of Information Filtering technology,researchers mostly focus on the exploration and improvement of diffenent classification algorithms. However, the feature selection has always been a basic work and a bottle-neck technology furthmore of Network Information Filtering.so, it is necessary to study feature selection algorithms.At present, common feature selection algorithms directly uses the conditions of independence assumptions among features , evaluates separately each feature in the feature set through constructing a evaluation function. But duing to in the absence of the relevant categories of features and redundancy of feature subsets, the feature subsets selected by these methods exist redundancy sometimes in the ability to distinguish between categories, and thus lead to a final classification ineffective.In this paper, for the related issues of feature selection algorithm in the information filtering system, the following aspects were studied and discussed:1. The strengths and weaknesses of feature selection commonly used were analysized ,and improvement of direction was pointed out for the weaknesses.This paper firstly gived the comprehensive analysis of feature selection technology, and emphatically introduced the framework of feature selection technology. At present, several feature selections commonly used have their strong points and weak points. We analyzed their advantages and disadvantages from the computational complexity and classification effect in this paper, and pointed out the reason that may lead to it. In addition, according to the literature data related, we described the experiment conclusions .this conclusions were same to the finally experimental results.2. A feature selection framework FSBC(feature selection based on correlation) was proposed from the definition of feature relativity and redundancy, that is the process of feature selection was separated into two-step section: first, selecting the feature subset that was related to categories; secondly, removing out the redundant feature item in the choosely feature subset through the redundancy analysis, and finally got the optimized feature subset.Firstly, for the selecting feature relevant of category, this paper constructs a evaluation function to selecting feature item according to the principle : if a feature item t frequently appear in the document belonging to one category, but few in other categories, then the feature item t can well represente this category ,and should be given a higher weight, and should be selected as the categories of feature words to distinguished from other category of documents. In addition, this paper introduces the idea of weight computing TFIDF, and considers combining the word frequency and the document frequency as the basis for the evaluation of features.Secondly, for the redundancy analysis, this paper adopts the algorithms of K-Means commonly used in the clustering method as core algorithm to removing redundancy. For the selection of the center of initial cluster and the number of the initial cluster in this algorithm, this paper has improved those issues in order to making similary K-Means algorithm reduce the redundancy of features set effectively.3.Finally, the proposed feature selection strategy was applied in the platform of Network Information Filtering, and achieved satistisfying experimental effect.This paper applied the feature selection framework of FSBC into Network Information Filtering System, and did experimental comparsion for Information Gain(IG) and CHI statistical methods. Experiments show that FSBC method is better than the other two methods in accuracy and recall rate, and it can make good performance especially in the higher dimension.
Keywords/Search Tags:information filtering, feature selection, word segmentation, text tegorization, clustering
PDF Full Text Request
Related items