With the rapid development of communication and Internet, various information increases exponentially. Text, the most typical information carrier, can not make an exception. In order to control and retrieve valuable information, research of automatic text categorization(TC) becomes very important.Text categorization is the assignment of predefined categories to documents based on their content.It is a core of text mining. The paper describe the basic theory of text categorization, discussed relevance technology of text categorization, constructe the vector model of text representation base on vector space model, and study the now available feature selection and algorithm. The main researches are focused as follows:(1)The whole process of text representation were discussed—word segmentation, building stop words list, feature selection, weight computation and generationg vector space.(2)Four methods of text categorization—Naive Bayes, KNN, SVM and Decision tree were introduced and compared.(3)Tree main parts of text words segmentation techniques, feature selection and extraction algorithms and categorization algorithms were analysed and researched, on the basis of the researches, give the improved algorithms. and discuss categorizing ability of the system by some experiments. The results of the experiments prove that the improved algorithms are effective and categorizing ability of the system is satisfied.(4)The researches on text categorization in future were prospected. |