With the ever-increasing volume and fast development of online information, there have been extensive studies and rapid progresses in large-scale text categorization. The large quantity of text and category induces the feature space to be high dimensionality, which brought the classification algorithm to high computational complexity and space complexity, and influences its expansibility. The effective dimension reduction to feature space can not only improve the efficiency and effectiveness of classification, but also can improve the generalization ability of classifier, and therefore the implementation of dimension reduction is essential.In this paper, we have studied the key techniques of text categorization, discussed the necessity of dimension reduction for feature space, and analyzed the classical methods of dimension reduction. The classical feature extraction algorithms can not deal with the feature space with high dimensionality, an approach to reduce dimensionality of feature space for large-scale text categorization is presented by using candid incremental principal component analysis and independent component analysis algorithm, basis of this, a combinative method of independent component analysis and information gain is adopted. Their effectiveness and feasibility are evaluated on comparative classification experiments.The experimental results shows that the iterative CCIPCA and ICA algorithms required less computational space, can effectively deal with the large-scale text categorization problems and improve the classification effect, ICA achieves the best performance among CCIPCA, ICA and ICA-IG in the same data set. |