Font Size: a A A

Research On Feature Selection And Weighting Methods Based On Terms Distribution

Posted on:2018-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:N FanFull Text:PDF
GTID:2348330515469298Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of Internet information technology,a large number of information presented in the form of electronic text has increased dramatically.The classification of text information can make information retrieval more effectively and time saving.Automatic text classification has become one of the research hotspots.In the processing of text data,a letter or a word is regarded as an independent feature,which leads to the high dimensional characteristics of text data.High dimension has become a problem that must be faced in the field of text classification.Feature selection can eliminate the irrelevant features and generate a lower dimensional feature subset,which can effectively solve the high dimensional problem and improve the classification performance.The research of feature selection algorithm is of great significance.After selecting an important subset of features,all documents should be represented in vector form.The weight of a term need to be calculated using a specific term weighting scheme to correctly identify the document.A good feature weighting method weights terms with different value,which based on terms distinguishing ability,so that the text vector between different classes is more dispersed,and the similarity in the same class of text vectors is higher.Therefore,this thesis focuses on two parts: feature selection methods and term weighting schemes in text classification task.The main contributions of this thesis are as follows:First,this thesis has an in-depth study on existing supervised feature selection methods.The existing feature selection methods are mainly based on the term frequency(TF)or the document frequency(DF).In some extent,they cannot characterize how the terms distribute in a certain document.Therefore we propose an improved feature selection method by integrating the paragraph frequency into term distribution among different class,named on feature selection based on term distribution among paragraphs and categories(FSPC).The comparative experiments are conducted on Fudan corpus and SogouCS corpus with support vector machine and naive Bayes as classifiers.The comparative feature selection methods contain chi square,document frequency,information gain and intra and inter comprehensive measurement.The experimental results show that performance of FSPC is superior to comparative methods.Second,this thesis has an in-depth study on existing supervised term weighting schemes,and analyzes the shortcomings of them.To overcome the shortcoming of them,the concept of variance is finely-grained introduced to measure the degree of term distribution among different categories and the inverse class frequency is coarse-grained introduced to measure global distribution of terms.A novel term weighting schemes named term frequency-distribution difference(TF-DD)is proposed in this thesis.Also,the comparative experiment is conducted on WebKB dataset with term frequency-inverse document frequency and term frequency-inverse gravity moment,using support vector machine and k-nearest neighbor as classifiers.The experimental results show that performance of TF-DD is superior to comparative term weighting schemes.
Keywords/Search Tags:Text Classification, Feature Selection, Term Weighting, Term Distribution
PDF Full Text Request
Related items