Font Size: a A A

Research On Feature Selection Algorithm Based On Segmented Term Frequency In Text Classification

Posted on:2019-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y B LiuFull Text:PDF
GTID:2428330566467882Subject:Computer software and theory
Abstract/Summary:
With the rapid development of network technology,the number of electronic documents has increased dramatically.And it makes text automatic classification technology become very important in organizing these documents.The most intractable problem in text categorization is how to deal with the high-dimensional feature space.Such excessive items influence not only the running time,but also the accuracy of classification.As an important part in the text classification process,feature selection can eliminate redundant features effectively and reduce the feature space dimension well.Therefore,studying feature selection has important practical significance in text classification.From the perspective of improving the accuracy and stability of classification,this thesis introduces the basic theories and related technologies of text categorization,including text preprocessing,feature reduction,feature weighting,classifier construction and performance evaluation.Based on this,the feature selection algorithm is deeply studied and two new feature selection algorithms are proposed.(1)A novel feature selection approach based on document frequency of segmented term frequency(STF-DF)is proposed.By analyzing the existing feature selection algorithms such as document frequency,information gain and chi-square test,it is not difficult to be found that these algorithms determine the document frequency only by the appearances of the feature in a document,without considering how many times the feature appears.However,it is far from enough.Therefore,this thesis puts forward the concept of segmented term frequency and document frequency of segmented term frequency,and proposes a novel feature selection approach based on document frequency of segmented term frequency.The algorithm takes into account the contribution of the same feature term in different frequencies to the classification.The experimental results show that STF-DF method has achieved a high performance and it is an effective feature selection algorithm.(2)A novel feature selection approach based on inverse class frequency of segmented term frequency(STF-ICF)is proposed.When the traditional ICF method calculates the feature importance,it assigns very low weights to those items that appear in all classes.Besides,ICF cannot distinguish between items with the same class frequency.Considering the inefficiencies above,this thesis proposes a novel feature selection approach based on inverse class frequency of segmented term frequency.Based on the concept of segmented term frequency,two new concepts(class frequency of segmented term frequency and weighted average class frequency)are proposed.Experimental results show that the STF-ICF algorithm has a good classification performance in micro-F1 and accuracy.
Keywords/Search Tags:Text categorization, Feature selection, Segmented term frequency, Document frequency, Class frequency
Related items