With the coming of information age, information resources manifest an explosive growth throughout internet. For the sake of not being submerged by mass of information, good measures should be taken to category and manage them. Text categorization plays an important role on how to use that information. It can effectively organize and manage information so that the efficiency of information retrieval will be improved.Research on text classification and its related technologies are done in the paper, and this paper proposes an approach for Tibetan text categorization without word segmentation. Compared with traditional text classification models, the approach based on character-level N ? Gram language modeling avoids word segmentation so that much computing resources during pre-processing will be saved.This paper firstly introduces the recent research situation of text categorization all over the world; secondly the commonly used text express models are analyzed and N-Gram model for Tibetan text is further studied, while the selection of parameter N and every function of Tibetan text categorization system are discussed. Thirdly, core function classifier of the system is presented in details in Chapter Four. The paper also proposes a corpus Naive Bayes Multinomial, which combines tightly with N ? Gram and makes good classifying effect. Finally, because of a large number of overlapped high-degree bigrams and biased high-degree bigrams in bigram feature set we put forward a novel feature reduction method,δ? OR,which can raise theδ? degreeoverlapped bigrams to corresponding trigrams. The experiment shows thatδ? OR method can not only achieves feature reduction as well deletes redundant information but also improves abilities to describe features and to classify features. In some degree of reduction, the categorization effect can be reasonably made better. |