With development of information technology and appearance of countless Internet web pages, vast amount of electronic documents come out. So how to making good use of these documents becomes a vital problem in information technology field. There are many techniques used for handling and organizing the text data, one of which is text categorization. The goal of text categorization is the classifying a free document into a fixed number of predefined categories. Over the past decade, theory about text categorization technology has been changing a lot and some text categorization system became available and used for real practice. Until now, text categorization is extensively researched in languages that are widely used but not in Mongolian. This is because in the text classification domain research on Mongolian starts quite late and the Mongolian word's automatic segmentation has a certain difficulty.In machine learning theory, construction of a classifier is the core issue in text categorization. After implementing a classifier, we use this classifier to learn knowledge from corpora which is very important for the system performances. Besides a good quality of corpora, two major factors affect a system performance at most. One is the algorithm that classifier use (including parameter tuning) and the other one is how to preprocess the corpora data and extract the features of a text. Like many other pattern classification problems, a successful text categorizer relies on the right model and the right features. In this thesis I will explore a simple Mongolian text categorization. In the preprocess stage, system will focus on Mongolian word stemming. In feature selection stage, four kinds of feature selection method will be used for extracting features. Two classification algorithms, K-nearest neighbor method and support vector machine, are chosen for constructing classifier. Compared with K-nearest neighbor algorithm, it is very difficult for working out a good classifier that adopts Support Vector Machine as its main arithmetic. So I choose LibSVM2.6(an open-source Support Vector Machine program) as a part of my system. After those step mentioned above, I will evaluate the performance of the system. This topic comes from Inner Mongolian Natural Fund Project: information retrieval technology research on Mongolian (Project approval number. 200408020805).The development of Text Categorization greatly promotes the retrieval and application of web information, personalized information push service, pattern of information gaining. So it has an important practical value. |