Font Size: a A A

Based On The Positive And Unlabeled Samples, Semi-supervised Classification

Posted on:2010-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:X FanFull Text:PDF
GTID:2208360275996323Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text categorization or classification is an important way for optimizing the text management. It is the automated assigning of the text documents to pre-defined classes. Traditional classification of machine learning includes supervised learning and unsupervised learning. The supervised learning has two major processes, training and test. After manually labeling some set of documents to pre-define categories or classes, a learning algorithm is used to produce a classifier. Many supervised learning techniques have been proposed by researches so far, e.g., the Rocchio algorithm, Naive Bayes, support vector machines and so on. Since labeling is done manually, it is labor intensive and time consuming. Unsupervised learning the machine could learn given simply receives inputs, but obtains neither supervised target outputs, nor rewards from its environment. But its accuracy is poor. Semi-supervised learning overcomes the shortcoming of them. With a small number of labeled data and a large pool of unlabeled documents, it performs better. Under the assumption of machine learning theory, the texts are selected randomly and training and test data distributions should be identical. In practice, this assumption may not hold for semi-supervised. The unlabeled examples usually come from the different distribution and complicated environment. So the small numbers of training examples are not adequate to the support of searching function in the assumption space. This paper proposed more effective and robust technique to learn from positive and unlabeled examples.This paper has done the work of several respects of the following mainly:1. Labels the documents that include latent class variables by using Bayes latent semantic model, combined with EM iterative algorithm. We also introduce the relevant clustering algorithm for initialing the training examples and the EM maximum likelihood estimation2. We use an entropy-based method with a Gaussian distribution to generate a frequency for unlabeled examples. Based on the information entropy, we determine the feature's distribution, and whether the document is suitable for this classification. The entropy values show the relative discriminatory power of the word features. The bigger a feature's entropy is, the more likely it has the similar distribution with the training examples.3. Propose two kinds of active learning strategies, which are integrated with our classification. To determining whether a document in unlabeled examples is suitable for this classification. Which documents in unlabeled examples could be used as the new training documents?4. A novel efficient method for BBS sentiment classification is presented by using maximum entropy model. the semantic tendency identification was studied. We identified whether the words had semantic tendency by using maximum entropy model. The words with polarity were selected as features and our Support Vector Machine classifier was built.
Keywords/Search Tags:text classification, semi-supervised, information-entropy, active- learning, sentiment classification, machine learning
PDF Full Text Request
Related items