Chinese Text Automatic Classification System - Of Chinese Words And Classifier Design

Posted on:2005-12-20

Degree:Master

Type:Thesis

Country:China

Candidate:X G Yang

Full Text:PDF

GTID:2208360125464036

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

In recent years. more and more information sources are now available in machine-readable form due to the rapid development of communication networks and inexpensive massive storage, information processing turns more and more important for us to get useful information. Text Categorization, the automated assigning of natural language texts to predefined categories based on their contents, is a task of increasing importance.In order to meet the realistic requirements of practical and scalable systems that can process real text, we carry out our researches about Chinese text categorization system in the following aspects:First, we study the Chinese word segmentation. Such technologies include words rough segmentation, unknown words recognition, part of speech tagging and disambiguation. Synthesized the shortest-path and entire-segmentation method, we present a model of rough segmentation, which is based on the N-shortest-paths-unitary-statistic method. For unknown words recognition, we use different method to recognize numeric phrase, reiterative locution and name. Especially, when recognizing names, we use Viterbi arithmetic to confirm the maximal probability of context state information sequence in a sentence, and combine the local statistic of text to match the recognizable models including compellation, placename and translated term. Finally, we use CLAWS arithmetic and combine the characteristic of different word having different probability to its part of speech tagging, based on hidden Markov model(HMM) to disambiguate part of speech tagging phase. The second aspect is classifier's design. This research focuses on such problems as feature extraction, representation of text, classifier's realization based on support vector machine(SVM). For feature extraction, according as Claude Shanon's informatics theory, we delete the words which exist in the stopping-word storage from high-frequency-word storage of each text sort. So we can get the type-word storage of every text sort. Based on these productions, we design a function for feature extraction and working-out its term weight. According to the representation of text, we use Vector Space Model(VSM) to represent them. Features are selected as terms, text is formatted as a vector D of N dimensions space. In order to implement the Classifier, we use the improved linear support vector machine(LSVM) based on the particular property of learning with text data. We add a slack term η to the optimization classification-surface based on the number of original misclassified training texts. The experiments show that the method is satisfactory to text categorization.

Keywords/Search Tags:

Text Categorization, Text Segmentation, Classifier, Support Vector Machine

PDF Full Text Request

Related items

1	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
2	Study On Text Categorization Method Based On Support Vector Machine
3	The Research And Application Of Automatic Text Classifier Based On Support Vector Machine
4	Study On Text Category Oriented Chinese Text Mining And Its Implementation
5	A Study On Chinese Text Automatic Categorization
6	A Study On Text Categorization Based On Machine Learning
7	The Research On Text Categorization Algorithm Based On Support Vector Machine
8	The Research And Implementation Of Chinese Text Categorization
9	Research On Parallel Text Classification Method Based On Support Vector Machine
10	Application For Web Text Categorization Based On Support Vector Machine