Font Size: a A A

Research On Feature Selection Algorithm Based On Term Discrete Factor In Text Classificayion

Posted on:2021-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:S HanFull Text:PDF
GTID:2428330626462952Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays,Big Data is widely used.How to deal with these large amounts of text data,and find useful information quickly and accurately from them is the urgent problem to be solved.Text classification can solve this problem,but the high dimension data will affect the efficiency of text classification.Feature selection is the most critical step in text classification.It can reduce the dimension number of feature space and improve the accuracy of text classification.Therefore,this paper mainly studies the feature selection algorithm in text classification.The paper mainly describes the detailed process of text classification and related technologies,which mainly includes text preprocessing,text representation model,feature selection algorithm for reducing the dimension of feature space,classification algorithms and evaluation indicators used to evaluate its classification performance.The methods and models in each step are introduced in detail.For the problem of high data dimension,the thesis deeply analyzes and studies related feature selection algorithms,and proposes two feature selection algorithms according to the distribution of terms.Experimental results show that these two algorithms can effectively improve the accuracy of text classification.(1)A feature selection algorithm(MTFS)based on the term positive rate is proposed.By analyzing the more commonly used feature selection algorithms,it can be found that most feature selection algorithms have not comprehensively considered the distribution of document frequency,word frequency,and terms in and between classes.Accordingly,MTFS algorithm proposed in this paper comprehensively considers the distribution of terms and the problem of highly sparse terms in the class.In the experiment,several popular feature selection algorithms were used to compare with it on four common data sets.And the experiment results show that MTFS algorithm is relatively better than other ones.(2)A feature selection algorithm(TIFS)is proposed based on word frequency importance By comparing the previous feature selection algorithms,it is found that many algorithms ignore an important factor(word frequency).Word frequency refers to the number of times feature words appear in the text of the data set.Word frequency is very important for feature selection in text classification.This algorithm fully considers the importance of word frequency in the feature selection algorithm,and introduces an important factor of word frequency and an inter-class aggregation factor to measure the effectiveness of the feature selection algorithm.In the experimental stage,NB classifier and SVM classifier are mainly used to compare the TIFS algorithm with the five popular feature selection algorithms on the four data sets.According to the experimental results,TIFS algorithm can improve the performance of text classification.It is a good and effective feature selection algorithm.
Keywords/Search Tags:Text classification, Feature selection, Term discrete factor, Word frequency importance
PDF Full Text Request
Related items