Font Size: a A A

Research On Feature Selection Method Based On Text Category Relevance Degree And Latent Semantic Analysis

Posted on:2019-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2428330563453731Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
At present,text information,as one of the most important information carriers,is growing faster than people think.In order to effectively deal with these information,automatic text classification technology has become an effective means for the classification of massive text data.However,in the processing of text data,the words or phrases in the document are usually regarded as independent features,which leads to the possibility that the vectors in the text feature space may reach thousands of tens of thousands of dimensions,and the high dimensional data may contain a lot of unrelated information.This is the "dimension disaster" we often say,which has become a necessary problem in the field of text classification.The feature selection method is used to delete the unrelated features of the original feature set to generate a new reduced feature subset.The purpose is to reduce the dimension of the high-dimensional feature space and reduce the interference of data noise,so as to improve the classification performance of subsequent text categorization tasks.Therefore,this dissertation focuses on the key part of feature selection in text categorization.The main research work of this dissertation is as follows:1.In this dissertation,after studying the current feature selection methods,most of the existing methods only consider the document frequency or term frequency unilaterally,and do not fully consider the influence of the distribution in intra-class and inter-class of features on the importance of the feature.Therefore,in this respect,a new feature selection method based on category relevance degree is proposed,which takes into consideration the distribution of features in intra-class and inter-class.It combines the document frequency,term frequency and inverse category frequency to construct a new evaluation function to measure the importance of feature to text classification.In Fudan University data set and 20 Newsgroups data set,support vector machine classifier and Naive Bayes classifier are used.The method of feature selection proposed in this dissertation is compared with four classical feature selection methods,which are document frequency,information gain,chi square statistics and term frequency measurement feature importance.Experimental results show that the proposed method is superior to the other four methods in improving text classification performance.2.A new two stage feature selection method is proposed by combining the proposed method based on category relevance degree with latent semantic analysis(LSA).In the first stage,the feature selection method based on category relevance degree is used to select the most representative features of the original feature set to form a new reduced feature subset.In the second stage,as for most methods ignoring the important relationship between features,a new latent semantic space is constructed by using the LSA method,taking into full consideration the semantic relationship between features and reducing the dimension of feature space.Four classic feature selection methods are combined with the LSA methods,including document frequency,information gain,chi square statistics and term frequency measurement feature importance,respectively.Four classical two stage feature selection methods are formed.In Fudan University data set and 20 Newsgroups data set,the support vector machine classifier is used to compare the five two stage feature selection methods.The experimental results show that the proposed method can effectively reduce the dimension of the feature space and improve the performance of the text classification.
Keywords/Search Tags:Text Categorization, Feature Selection, Feature Space Reduction, Category Relevance Degree, Latent Semantic Analysis
PDF Full Text Request
Related items