Font Size: a A A

Text Categorization Based On Regularized Linear Models

Posted on:2013-03-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:W B ZhengFull Text:PDF
GTID:1228330395989248Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text is one of the most fundamental and important carrier of information. With the development of information technology, text information bursts rapidly. Thus, it is a great challenge that how to organize and manage these massive information, and how to obtain the required information quickly, accurately, and comprehensively. Text categorization is a powerful approach to organize and manage text information, and it is also an important underpinning of information retrieval and data mining.In this thesis, the related researches of text categorization are introduced. And then, based on the regularized linear model and its recent developments, we focused on several aspects:dimensionality reduction, text representation, fast learning of classifier, and the model consistency between dimensionality reduction and classification. The major works of this dissertation are as follows:1. A dimensionality reduction approach with category information fusion and non-negative matrix factorization is proposed. Since the multi-label category information is difficult to be utilized by the traditional non-negative matrix factorization for di-mensionality reduction, this thesis presents a method to fuse category information of documents into the matrix factorization via a category coding and dimensional-ity extension, such that the discriminability of the basis vectors could be enforced. After that, a non-negative matrix factorization algorithm was developed, where the basis vectors were driven to orthogonality, which aims to reduce the redundant in-formation. With the truncation and transformation of matrices, the dimensionality reduction was implemented, which can map documents from the high dimensional term space into a low dimensional semantic subspace spanned by the non-negative basis vectors. Experimental results show that the proposed approach remains good classification performance even in a very low dimensional situation. 2. A novel method called non-negative sparse semantic coding for text categorization is presented, which is used to tackle the dense representation problem that is not con-sistent with our common knowledge, and to tackle the issues that the popular sparse coding methods are time-consuming and their dictionaries might contain negative entries. This paper developed an efficient algorithm to construct a non-negative dic-tionary that contains much discriminative semantic concepts and little redundancy as possible. After that, in a low dimensional semantic subspace spanned by the dictio-nary, all documents can be represented with a non-negative sparse form to keep con-sistent with our common knowledge. Experimental results show that the proposed approach achieves good performance and provides more interpretability.3. A text categorization method based on the extreme learning machine (ELM) is p-resented. ELM is a fast-developing technology of machine learning in recent years. Generally, its model can be obtained analytically, which avoids the convergence dif-ficulties existing in traditional methods, so it has a very fast learning speed. To deal with the problems when ELM is applied to the high dimensional and sparse text da-ta, this thesis constructed a regularized extreme learning machine (RELM), whose analytical solution and theoretical proof that ensures the existence of solution are also given. Finally, a classification method was presented according to the structure feature of the model. Experimental results show that the proposed method can obtain competitive performance in most cases and can learn faster than the conventional popular learning algorithms such as the back propagation neural networks or support vector machine.4. A grouped structure-based regularized regression model is proposed. Generally, a regression model with the lasso constraint could keep consistent between dimen-sionality reduction and classification. However, the correlation among text features might lead to an excessively sparse solution for the model (i.e., some important dis-criminating features might be discarded). In this thesis, a grouped structure was con-structed with a clustering algorithm according to the correlation of text features. The grouped structure then was embedded into a logistic regression model via a between-and within-group sparse manner. Thus, the groups containing many important fea-tures can be selected even the features in these groups are highly correlated, and the noise within the selected groups could be discarded simultaneously during the model fitting. After that, the classification was implemented in this model. The ex-perimental results show that the proposed method achieves a good tradeoff between the performance and sparsity in most scenarios.
Keywords/Search Tags:Text Categorization, Regularization, Linear Model, Dimensionality Re-duction, Non-negative Matrix, Semantic, Sparse Constraint, Multi-Label, ExtremeLearning Machine
PDF Full Text Request
Related items