Automatic Text Categorisation Research On Mongolian And Implentation Of The Tool

Posted on:2008-02-07

Degree:Master

Type:Thesis

Country:China

Candidate:D Su

Full Text:PDF

GTID:2178360215991526

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With development of information technology and appearance of countless Internet web pages, vast amount of electronic documents come out. So how to making good use of these documents becomes a vital problem in information technology field. There are many techniques used for handling and organizing the text data, one of which is text categorization. The goal of text categorization is the classifying a free document into a fixed number of predefined categories. Over the past decade, theory about text categorization technology has been changing a lot and some text categorization system became available and used for real practice. Until now, text categorization is extensively researched in languages that are widely used but not in Mongolian. This is because in the text classification domain research on Mongolian starts quite late and the Mongolian word's automatic segmentation has a certain difficulty.In machine learning theory, construction of a classifier is the core issue in text categorization. After implementing a classifier, we use this classifier to learn knowledge from corpora which is very important for the system performances. Besides a good quality of corpora, two major factors affect a system performance at most. One is the algorithm that classifier use (including parameter tuning) and the other one is how to preprocess the corpora data and extract the features of a text. Like many other pattern classification problems, a successful text categorizer relies on the right model and the right features. In this thesis I will explore a simple Mongolian text categorization. In the preprocess stage, system will focus on Mongolian word stemming. In feature selection stage, four kinds of feature selection method will be used for extracting features. Two classification algorithms, K-nearest neighbor method and support vector machine, are chosen for constructing classifier. Compared with K-nearest neighbor algorithm, it is very difficult for working out a good classifier that adopts Support Vector Machine as its main arithmetic. So I choose LibSVM2.6(an open-source Support Vector Machine program) as a part of my system. After those step mentioned above, I will evaluate the performance of the system. This topic comes from Inner Mongolian Natural Fund Project: information retrieval technology research on Mongolian (Project approval number. 200408020805).The development of Text Categorization greatly promotes the retrieval and application of web information, personalized information push service, pattern of information gaining. So it has an important practical value.

Keywords/Search Tags:

Mongolian, Automatic Text Categorisation, Vector Space Model, K-Nearest Neighbour Classifier, Support Vector Machine

PDF Full Text Request

Related items

1	Automatic Classification Research On HTML Document And Implentation Of The Tool
2	The Research And Application Of Automatic Text Classifier Based On Support Vector Machine
3	Research On Some Issues In Support Vector Machines
4	Design And Implementation Of The Technical Text Categorization System
5	Automatic Classification Research On Chinese Web Document Orientation
6	Design And Implementation On The Text Classifier Based On Support Vector Machine
7	Research On Key Techniques For Gait Recognition
8	Research On Support Vector Machine Based Text Classfication
9	Studies On Classifiers Based On Decision Boundaries From The Perspective Of Dividing Data Space
10	A Study On Chinese Text Automatic Categorization