Font Size: a A A

Research And Application Of Feature Selection Based On Term Frequency Reordering Of Document Level

Posted on:2020-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2428330596479690Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the amount of text data is growing.It is an urgent need to locate effective information efficiently and accurately.Text categorization as a core technology for information processing has become a key means to solve such problems.In the process of text classification,high-latitude data will cause dimensional disasters in the classification process.Feature selection is an effective method to solve dimension disaster and realize dimension reduction.Therefore,this thesis focuses on feature selection in text classification.The thesis begins with a brief overview of text categorization and a detailed description of the process.Some common methods are listed in each step.Secondly,two new feature selection methods are proposed for the problem of dimensional disaster in the process of text classification.(1)An improved feature selection algorithm based on NDM is proposed(TF-NDM).By analyzing the common feature selection algorithms,it is found that most of the algorithms rely on the document frequency and do not consider the term frequency.Therefore,this thesis introduces the term frequency ratio based on document frequency of NDM algorithm with excellent performance.And category information and term proportion are fully considered.Finally,different experimental results on five data sets show that the improved TF-NDM algorithm has better performance and can effectively improve the classification performance.(2)By further considering the allocation of document frequency and term frequency,a feature selection algorithm based on document specificity and term diversity(DSTD)is proposed.This algorithm unifies multiple calculation of document frequency macroscopically and considers the distribution of terms from multiple angles microscopically.It proposes two new factors of document specificity and term diversity.The DSTD algorithm is an effective combination of two influencing factors,which can fully exploit the respective advantages of document frequency and term frequency.Finally,the effectiveness of the DSTD algorithm is verified by the comparative analysis of seven algorithms on three data sets.This thesis deeply studies the document frequency and term frequency in text data sets.Two feature selection algorithms are proposed from different angles to solve the problem of feature selection one-sidedness in feature sorting.The two algorithms combine many aspects to select representative features.And experiments show that they have good effect.
Keywords/Search Tags:Text classification, Feature selection, Document frequency, Term diversity
PDF Full Text Request
Related items