Font Size: a A A

Text Classification Analysis Based On Ultrahigh Dimensional Feature Screening

Posted on:2022-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:J X TangFull Text:PDF
GTID:2480306539953289Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Text classification is a key technology in the field of natural language processing,which aims to classify texts as predefined categories based on the feature words.The number of feature words is tens of thousands,but the number of documents(sample size)is relatively small.It conforms to the characteristics of ultrahigh dimensional data.Faced with the classification problems of ultrahigh dimensional data,traditional machine learning procedures are difficult to complete the classification task efficiently because of the high dimension and high computational cost.Therefore,statisticians began to explore ways about reducing dimensions,and then proposed a series of ultrahigh dimensional feature screening procedures.These procedures can reduce dimensions by removing feature words which are irrelevant to categories and retaining relevant feature words(variables or features),then they used machine learning procedures for feature selection and classification.In this paper,the problem of theory and application of feature screening in text classification is deeply studied.The main research contents are as follows.(1)From the perspective of the difference between conditional density function and unconditional density function,this paper proposes an ultrahigh dimensional feature screening procedure named JS based on Jensen-Shannon divergence.Compared with other existing ultrahigh dimensional feature screening procedures,the JS procedure has the following advantages: firstly,it does not need underlying model;secondly,when variables obey heavy-tailed distribution or data is unbalanced,the JS procedure still has better screening performance;thirdly,under some regular conditions,the JS procedure satisfies the sure screening property that when sample size is close to infinity,the probability of all important variables screened is close to 1.Finally,this paper carries out Monte Carlo simulation studies and a real-data analysis to further show the effectiveness of the JS procedure.(2)In order to reduce feature relevance and further improve the performance of the classifier,we propose the text classification procedure named JS-PCA based on the principal component analysis(PCA)and the JS screening procedure.The JS-PCA procedure uses PCA to simplify the feature space screened by the JS procedure for reducing feature relevance,and furthermore,machine learning procedures are used to solve the problem of text classification.The effectiveness of the proposed JS-PCA procedure in text classification is verified by analysis of the Sohu news set.
Keywords/Search Tags:Text classification, Ultrahigh dimensional data, Feature screening, JensenShannon divergence
PDF Full Text Request
Related items