The Research On Text Categorization Technology Based On Partial Least Square

Posted on:2007-07-27

Degree:Master

Type:Thesis

Country:China

Candidate:Y S Luo

Full Text:PDF

GTID:2178360185972806

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the explosive growth of the online electronic documents, the automated text categorization (or text classification, TC) is becoming more important in the applications of information retrieval (IR), information filter and content management in the last 10 years, and has become forward research area of IR and machine learning (ML). Text categorization is the procedure of automatically assigning predefined categories to free text documents, and the TC method based-learning has become mainstream technology. By employinging the statistical theory of partial least square regression (PLS) and kernel partial least square regression (KPLS), our works focus on the TC technique based on the learning approach.Effective dimensionality reduction could make the learning task more efficient and more accurate in text classification. Feature selection and feature extraction are common methods for dimensionality reduction. The advantage of the feature selection is that semantic information is obtained, but the performance in text classification is not excellence. Feature extraction is helpful in avoiding the problems of synonymy and polysemy, but the semantic interpretation of the features is difficult to give. We propose two-step feature selection method based LSC (Latent Semantic Classification Model): in the first stage, the LSC model is used to select features; in the second stage, the VIP (Variable Importance in Projection) is adopted to measure the importance of the features and the features are selected according to it. Experiments on Fudan University Chinese Text Classification Corpus showed that the new approach could capture the semantic information of the categories and performed better than those selected by others with several classical classification algorithms.LSC model which considers both text feature and classification information is virtually a linear model. So a nonlinear Kernel Latent Semantic Classification Model (KLSC) is proposed based on kernel method, and can also capture latent semantic structure information. Experiments showed that this model was effective.Both the LSC model and KLSC model are face to a key problem how to determine the number of the latent variable-pairs. The solution to this problem in them is by means of the threshold ε to control the number. Experiments showed that the more the feature dimensionality increased, the more sensitive micro-averaging F1 value and macro-averaging F1 value were, and that the relationship between threshold ε and the number of the latent variable-pair was linear in the LSC model but nonlinear in the KLSC model. We also found that about 20 concepts could express the semantic information of one category.

Keywords/Search Tags:

Text Classification, Latent Semantic Classification, Partial Least Square regression Regression, Kernel method, Kernel Partial Least Square Regression, Dimensionality Reduction, Feature Selection, Feature Extraction

PDF Full Text Request

Related items

1	The Research On Latent Semantic Classification Model
2	Support vector machine/regression feature selection with an application towards classification
3	Study On The Algorithm Of COD Monitoring Sensor Based On The Partial Least-square Regression And Turbidity Compensation
4	Automatic Text Multi-Classification Model Based On Class Latent Semantic
5	Research Of Partial Least Squares Regression Algorithm Based On Optimal Selection Of Latent Variables
6	Classification Of High Dimensional Data Based On Dimensionality Reduction And Kernel Learning
7	Face Recognition Method Of Identifying Projection And Regression Classification
8	Age Estimation Based On Face Image
9	Research On Ensemble Regression Learning Based On Classifier Selection And Multiple Kernel Selection Under Least Squares Framework
10	Research On Partial Least Squares Dimension Reduction Based Facial Age Estimation