Font Size: a A A

Research On Feature Description And Classifier Construction Algorithm In Chinese Text Classification

Posted on:2007-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2178360185974706Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development and spread of Internet, electronic text information increases exponentially. An important research is focused on how to extract knowledge and models from this great number of online documents. As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine and so on.This paper mainly focuses on the two pivotal problems in text classification tasks, including text feature description method and classifier construction algorithm. The main works of this thesis are as follows:1. A context-based text feature description method is proposed in this paper. Text feature description is considered as the basic problem in text classification and it aims to use computable feature to denote documents. The most used feature description method treats a text as a set of discrete words, which called"bag of words"mode, under this mode feature selection and weighting consider the"frequency"of single word only, ignoring the relation of words in context. But generally words in a certain context field can deliver correlative meaning for a same topic. So the"bag of words"mode loses the context information that main improving classification precision. This paper put forward a new feature description method based on text context. First, employs a commonly used feature selection method to get a initial set of feature words; second, compute the reliance of words in a concrete context by Mutual Information (MI), then, extract words that have high reliance in the same context, and adjust the weight of each feature. The result explained that the new method outperforms traditional methods.2. An algorithm is designed for training text classifier based on SVM active learning Text classification algorithms are supervised which means the classifier training need some human labeled data of fixed classes. Generally, the accuracy of classifier is higher with more labeled data. Actually, most time training set contains a great deal of redundancy data, which can't contribute to the classification accuracy, in the other hand the labeled data by hand are expensive resource. Therefore one vital problem with text classification is how to reduce the number of labeled data while maintain the proper accuracy. This paper presents a new text categorization algorithm for performing active learning with support vector...
Keywords/Search Tags:Text Classification, Feature Extraction, Machine Learning, Active Learning, Support Vector Machine
PDF Full Text Request
Related items