Font Size: a A A

The Research On Text Categorization Technology Based On Partial Least Square

Posted on:2007-07-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y S LuoFull Text:PDF
GTID:2178360185972806Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the explosive growth of the online electronic documents, the automated text categorization (or text classification, TC) is becoming more important in the applications of information retrieval (IR), information filter and content management in the last 10 years, and has become forward research area of IR and machine learning (ML). Text categorization is the procedure of automatically assigning predefined categories to free text documents, and the TC method based-learning has become mainstream technology. By employinging the statistical theory of partial least square regression (PLS) and kernel partial least square regression (KPLS), our works focus on the TC technique based on the learning approach.Effective dimensionality reduction could make the learning task more efficient and more accurate in text classification. Feature selection and feature extraction are common methods for dimensionality reduction. The advantage of the feature selection is that semantic information is obtained, but the performance in text classification is not excellence. Feature extraction is helpful in avoiding the problems of synonymy and polysemy, but the semantic interpretation of the features is difficult to give. We propose two-step feature selection method based LSC (Latent Semantic Classification Model): in the first stage, the LSC model is used to select features; in the second stage, the VIP (Variable Importance in Projection) is adopted to measure the importance of the features and the features are selected according to it. Experiments on Fudan University Chinese Text Classification Corpus showed that the new approach could capture the semantic information of the categories and performed better than those selected by others with several classical classification algorithms.LSC model which considers both text feature and classification information is virtually a linear model. So a nonlinear Kernel Latent Semantic Classification Model (KLSC) is proposed based on kernel method, and can also capture latent semantic structure information. Experiments showed that this model was effective.Both the LSC model and KLSC model are face to a key problem how to determine the number of the latent variable-pairs. The solution to this problem in them is by means of the threshold ε to control the number. Experiments showed that the more the feature dimensionality increased, the more sensitive micro-averaging F1 value and macro-averaging F1 value were, and that the relationship between threshold ε and the number of the latent variable-pair was linear in the LSC model but nonlinear in the KLSC model. We also found that about 20 concepts could express the semantic information of one category.
Keywords/Search Tags:Text Classification, Latent Semantic Classification, Partial Least Square regression Regression, Kernel method, Kernel Partial Least Square Regression, Dimensionality Reduction, Feature Selection, Feature Extraction
PDF Full Text Request
Related items