Font Size: a A A

Research On High Performance Chinese Text Classification Based On Machine Learning

Posted on:2010-09-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:C X YangFull Text:PDF
GTID:1118360302973967Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text Classification is an important research topic in the information processing area. With the development of information technology, particularly the matureness of machine-learning based text classification in the 1990s, text classification is widely applied in the area such as nature language understanding and processing, information organizing and management, content filtering, etc. Because more and more text classiffication applies in those areas, this greatly promotes further research on text classification and makes text classification technology a research topics in computer technology.According to the different ways of classification learning, text classification based on machine learning can be divided into three different kinds of methods: supervised classification, semi-supervised classification and unsupervised classification. Supervised text classification is called text categorization, TC for short. The main goal of TC is: given a text and predefined training sets with different class labels, decides which class the text belongs to. The unsupervised classification is called Clustering. It groups a set of data in a way that minimizes the similarity within cluster and maximizes the similarity between two different clusters. Semi-supervised classfication is between supervised classification and unsupervised classification. It focus on how to attain the good performance and generalization when there is not enough training sample or part of the data information is lost, so that it can accurately classified the texts.Whatever classification algorithms it may be, for high dimensional texts, feature extraction and feature selection serving as important methods for reducing dimensionality. play important roles in the performance of computing complexity and classification. They also receive attention such as large scale data, unstructured data, dimensionality disaster and data imbalance from more and more researchers..In this paper, we focus on Chinese text classification research. We concentrate our research on four areas: feature extraction, feature selection, text categorization and clustering. Firstly, we proposed three algorithms: feature extraction algorithm based on sentence elements, balanced feature selection algorithm and low limit dimensionality of feature selection; secondly, we proposed balanced feature indexing KNN classification algorithm and feature compensation KNN algorithm, then we applied the balanced feature indexing KNN to nonlinear semi-supervised classification. Finally, we proposed a graph-theoretic clustering algorithm (WGC algorithm ) on the base of the work of Hartuv and Shamir. Here are the further descriptions of our research:1. Text feature selection algorithm based on sentence elements. During the process of text feature extraction, we often encounter terms that have nothing to do with the subject. According to the fact that different sentence elements play different roles in expressing a subject, in this paper, we use syntactical analysis to label sentence elements and then propose feature extraction algorithm based on sentence elements. Experimental results show that the algorithm effectively not only filter some terms that have nothing to do with the subject, but also avoid the disadvantage of using stop-list table and part-of-speech filtering.2. Research on balanced feature selection algorithm. Aimed at dealing with the problems that the assumption of data classes is not satisfied in practice and data skewed exists , in this paper, we analyze the text classification objective function and then propose balanced feature selection algorithm. We theoretically and experimentally prove and verify the validity and effectiveness of the algorithm. We also propose method for calculating the low limit dimensionality when a feature selection function is selecting features in a certain document set. A non average dimensionality feature selection algorithm in cases of low limit dimension is also proposed.3. Research on high performance text classification algorithm. In order to improve the speed by reducing the matching between unlabeled samples and the unrelated vector sets, in this paper, we use the feature sets as the classification index of unlabeled documents and propose a KNN classification algorithm based on feature space indexing. Experimental results show that the increase of dimensionality has little effect in classification time. In addition, in order to improve the accuracy of classification, we construct the compensation feature sets which contain features that are not include in the feature sets but have some classification ability. We propose a features compensation KNN algorithm. Finally, combining balanced features selection and robust path regularization, we realize non-linear semi-supervised classification.4. Clustering algorithm for weighted graph based on minimum cut. Building on the work of Hartuv and Shamir, we propose a graph-theoretic clustering algorithm for weighted graph based on minimum cut (WGC). The algorithm has the advantage of many existing algorithm: low polynomial complexity, the provable properties, and automatically determining the number of clusters in the process of clustering.
Keywords/Search Tags:Feature Extraction, Feature Selection, Dimensionality Reduction, Text Categorization, Semi-supervised Learning, Text Clustering
PDF Full Text Request
Related items