Text Classification Analysis Based On Ultrahigh Dimensional Feature Screening

Posted on:2022-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:J X Tang

Full Text:PDF

GTID:2480306539953289

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

Text classification is a key technology in the field of natural language processing,which aims to classify texts as predefined categories based on the feature words.The number of feature words is tens of thousands,but the number of documents(sample size)is relatively small.It conforms to the characteristics of ultrahigh dimensional data.Faced with the classification problems of ultrahigh dimensional data,traditional machine learning procedures are difficult to complete the classification task efficiently because of the high dimension and high computational cost.Therefore,statisticians began to explore ways about reducing dimensions,and then proposed a series of ultrahigh dimensional feature screening procedures.These procedures can reduce dimensions by removing feature words which are irrelevant to categories and retaining relevant feature words(variables or features),then they used machine learning procedures for feature selection and classification.In this paper,the problem of theory and application of feature screening in text classification is deeply studied.The main research contents are as follows.(1)From the perspective of the difference between conditional density function and unconditional density function,this paper proposes an ultrahigh dimensional feature screening procedure named JS based on Jensen-Shannon divergence.Compared with other existing ultrahigh dimensional feature screening procedures,the JS procedure has the following advantages: firstly,it does not need underlying model;secondly,when variables obey heavy-tailed distribution or data is unbalanced,the JS procedure still has better screening performance;thirdly,under some regular conditions,the JS procedure satisfies the sure screening property that when sample size is close to infinity,the probability of all important variables screened is close to 1.Finally,this paper carries out Monte Carlo simulation studies and a real-data analysis to further show the effectiveness of the JS procedure.(2)In order to reduce feature relevance and further improve the performance of the classifier,we propose the text classification procedure named JS-PCA based on the principal component analysis(PCA)and the JS screening procedure.The JS-PCA procedure uses PCA to simplify the feature space screened by the JS procedure for reducing feature relevance,and furthermore,machine learning procedures are used to solve the problem of text classification.The effectiveness of the proposed JS-PCA procedure in text classification is verified by analysis of the Sohu news set.

Keywords/Search Tags:

Text classification, Ultrahigh dimensional data, Feature screening, JensenShannon divergence

PDF Full Text Request

Related items

1	Feature Screening Of Ultra-high Dimensional Classification Data With Exposure Variables
2	Feature Screening Ultrahigh Dimensional Longitudinal Data
3	Feature Screening For Ultrahigh-dimensional Categorical Data
4	Research And Application Of Variable Method For Ultrahigh Dimensional Data
5	Study On Ultrahigh Dimensional Feature Screening
6	Study On Ultrahigh Dimensional Feature Screening And Its Application
7	Research On Feature Screening Method For Ultrahigh Dimensional Discriminant Analysis Data
8	Feature Screening Ultrahigh Dimensional With Surrogate Data
9	Gini-Index Based Feature Screening For Ultrahigh Dimensional Catagorical Data
10	Research On Several Feature Screening Problems Under Ultrahigh Dimensional Linear Models