| Nowadays, with the information technology development, machine learning and pattern recognition in computer science are more and more mature and widely applied to many areas, one of the important research direction is based on the statistics of natural language processing. Due to the rise of the Internet, the electronic text information based on natural language description is exploding, the information processing based on natural language is one of the biggest target how effective these information acquisition and management. These questions raised a lot of research and applications on natural language, of which text categorization as the basis for information retrieval problems, is especially paid attention.Text categorization mainly divided into two stages, using natural language processing, machine learning, pattern recognition, text mining technology to realize. Therefore, the value of text classification theory research reflected in these technologies. Text classification can effectively improve the effect of online information retrieval, not only to improve the information of acquisition modes, but also an important aspect of content security. Therefore classified the performance has become the focus of attention, the research of text classification task and engineering application, will be having the important meaning.In the existing research results, text categorization and related technologies have been done some research. The beginning of the thesis introduces the status of the text classification and the research significance; Then it introduces text classification process and the related technologies in the process, it also has researched the Chinese word segmentation method, feature selection method, text classification algorithm ; the thesis introduces the design of the text categorization, the process in order to eliminate ambiguity for three characters long ambiguous phrases of overlap type and process stop words, the best match points of lexical was improved, meanwhile based on KL dispersion degree feature selection method and combining the characteristics of TFIDF weights, such a feature selection can compare to accurately express the text, lay a good foundation for classification, Finally to the Bayes algorithm, simple vector distance classification and KNN (K nearest neighbors) algorithm, the thesis had found out the classification results compared with time complexity and selected a better practical algorithm. |