Research On Feature Selection Algorithm Based On Term Discrete Factor In Text Classificayion

Posted on:2021-02-09

Degree:Master

Type:Thesis

Country:China

Candidate:S Han

Full Text:PDF

GTID:2428330626462952

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Nowadays,Big Data is widely used.How to deal with these large amounts of text data,and find useful information quickly and accurately from them is the urgent problem to be solved.Text classification can solve this problem,but the high dimension data will affect the efficiency of text classification.Feature selection is the most critical step in text classification.It can reduce the dimension number of feature space and improve the accuracy of text classification.Therefore,this paper mainly studies the feature selection algorithm in text classification.The paper mainly describes the detailed process of text classification and related technologies,which mainly includes text preprocessing,text representation model,feature selection algorithm for reducing the dimension of feature space,classification algorithms and evaluation indicators used to evaluate its classification performance.The methods and models in each step are introduced in detail.For the problem of high data dimension,the thesis deeply analyzes and studies related feature selection algorithms,and proposes two feature selection algorithms according to the distribution of terms.Experimental results show that these two algorithms can effectively improve the accuracy of text classification.(1)A feature selection algorithm(MTFS)based on the term positive rate is proposed.By analyzing the more commonly used feature selection algorithms,it can be found that most feature selection algorithms have not comprehensively considered the distribution of document frequency,word frequency,and terms in and between classes.Accordingly,MTFS algorithm proposed in this paper comprehensively considers the distribution of terms and the problem of highly sparse terms in the class.In the experiment,several popular feature selection algorithms were used to compare with it on four common data sets.And the experiment results show that MTFS algorithm is relatively better than other ones.(2)A feature selection algorithm(TIFS)is proposed based on word frequency importance By comparing the previous feature selection algorithms,it is found that many algorithms ignore an important factor(word frequency).Word frequency refers to the number of times feature words appear in the text of the data set.Word frequency is very important for feature selection in text classification.This algorithm fully considers the importance of word frequency in the feature selection algorithm,and introduces an important factor of word frequency and an inter-class aggregation factor to measure the effectiveness of the feature selection algorithm.In the experimental stage,NB classifier and SVM classifier are mainly used to compare the TIFS algorithm with the five popular feature selection algorithms on the four data sets.According to the experimental results,TIFS algorithm can improve the performance of text classification.It is a good and effective feature selection algorithm.

Keywords/Search Tags:

Text classification, Feature selection, Term discrete factor, Word frequency importance

PDF Full Text Request

Related items

1	Research On Feature Selection Algorithm Based On Segmented Term Frequency In Text Classification
2	Study On Feature Selection Reselected By Term Frequency In Text Classification
3	Research And Application Of Feature Selection Based On Term Frequency Reordering Of Document Level
4	Research And Application Of Feature Selection And Feature Weighting Algorithm Of Text Classification
5	Research On Text Classification Based On Feature Selection And Feature Weighting Algorithm
6	Research On Some Problems In Text Classification
7	Research And Application On Feature Selection Algorithms Based On Term Distributions In Text Categorization
8	Hybrid Text Feature Selection Method Based On Word Frequency And Word Position
9	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
10	A Research On Feature Extraction Applied For Text Classification