Font Size: a A A

Chinese Text Automatic Classification System - Of Chinese Words And Classifier Design

Posted on:2005-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:X G YangFull Text:PDF
GTID:2208360125464036Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years. more and more information sources are now available in machine-readable form due to the rapid development of communication networks and inexpensive massive storage, information processing turns more and more important for us to get useful information. Text Categorization, the automated assigning of natural language texts to predefined categories based on their contents, is a task of increasing importance.In order to meet the realistic requirements of practical and scalable systems that can process real text, we carry out our researches about Chinese text categorization system in the following aspects:First, we study the Chinese word segmentation. Such technologies include words rough segmentation, unknown words recognition, part of speech tagging and disambiguation. Synthesized the shortest-path and entire-segmentation method, we present a model of rough segmentation, which is based on the N-shortest-paths-unitary-statistic method. For unknown words recognition, we use different method to recognize numeric phrase, reiterative locution and name. Especially, when recognizing names, we use Viterbi arithmetic to confirm the maximal probability of context state information sequence in a sentence, and combine the local statistic of text to match the recognizable models including compellation, placename and translated term. Finally, we use CLAWS arithmetic and combine the characteristic of different word having different probability to its part of speech tagging, based on hidden Markov model(HMM) to disambiguate part of speech tagging phase. The second aspect is classifier's design. This research focuses on such problems as feature extraction, representation of text, classifier's realization based on support vector machine(SVM). For feature extraction, according as Claude Shanon's informatics theory, we delete the words which exist in the stopping-word storage from high-frequency-word storage of each text sort. So we can get the type-word storage of every text sort. Based on these productions, we design a function for feature extraction and working-out its term weight. According to the representation of text, we use Vector Space Model(VSM) to represent them. Features are selected as terms, text is formatted as a vector D of N dimensions space. In order to implement the Classifier, we use the improved linear support vector machine(LSVM) based on the particular property of learning with text data. We add a slack term η to the optimization classification-surface based on the number of original misclassified training texts. The experiments show that the method is satisfactory to text categorization.
Keywords/Search Tags:Text Categorization, Text Segmentation, Classifier, Support Vector Machine
PDF Full Text Request
Related items