Font Size: a A A

Improved Feature Selection Algorithm And ITS Application In Text Categorization

Posted on:2019-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ZhuFull Text:PDF
GTID:2428330596458858Subject:Engineering
Abstract/Summary:PDF Full Text Request
As the information technology is developing,text mining technology was produced to deal with huge mass of information.And text classification is not only a basic algorithm but also an important approach for text mining,it can efficiently extract all kinds of basic attributes in text.As operational taxonomic models,Vector Space Model and Conditional Probability Model are usually used for text classification technology,which both comprise keywords.Original data sets of keywords are all words of current language model.If not chosen,the bag of words or eigenvector dimension presenting texts will be very huge,noise words with no contribution to text classification will be included as well.Thus the time complexity of algorithm rises and the accuracy falls.So it is necessary to employ the feature selection algorithm to filtrate keywords,keeping words with sufficient contribution for classification,thus to lower the model dimensions,raise classification efficiency and accuracy.This research: 1)analyzed core modules of text classification for Chinese and realized the completion of the classification system;2)the concepts of "holistic selection method" and "differentiated selection method" are proposed,and based on these concepts,proposed an improved method for feature selection algorithm by combination;3)based on the improved method,proposed a new feature selection algorithm and conducted the experimental test and analyzation employing text implementation system.The main work is as follows:First of all,by literature research,studied key points and research status of text classification for Chinese,including text content extraction,Chinese word segmentation algorithm,feature word selection algorithm,character weight calculation algorithm and text classifier.Secondly,the optimization direction of feature selection algorithm was emphasized,and the concepts of "holistic selection method" and "differentiated selection method" are proposed.Based on these concepts,this research proposed an improved method for feature selection algorithm by combinationThirdly,based on the combination method,this research proposed an improved feature selection algorithm by combinating DF algorithm?IG algorithm and CHI representation,variance to lower model dimensions and raise the efficiency and accuracy of classification algorithm.At last,code of this research accomplished a completed improved algorithm for a text classification system for Chinese.Based on testing and analyzing,comparation among four classic feature selection algorithms and the improved one of this research was conducted to prove the efficiency.
Keywords/Search Tags:Text Classification, Feature Selection algorithm, Variance, CHI algorithm
PDF Full Text Request
Related items