Based On The Positive And Unlabeled Samples, Semi-supervised Classification

Posted on:2010-11-23

Degree:Master

Type:Thesis

Country:China

Candidate:X Fan

Full Text:PDF

GTID:2208360275996323

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text categorization or classification is an important way for optimizing the text management. It is the automated assigning of the text documents to pre-defined classes. Traditional classification of machine learning includes supervised learning and unsupervised learning. The supervised learning has two major processes, training and test. After manually labeling some set of documents to pre-define categories or classes, a learning algorithm is used to produce a classifier. Many supervised learning techniques have been proposed by researches so far, e.g., the Rocchio algorithm, Naive Bayes, support vector machines and so on. Since labeling is done manually, it is labor intensive and time consuming. Unsupervised learning the machine could learn given simply receives inputs, but obtains neither supervised target outputs, nor rewards from its environment. But its accuracy is poor. Semi-supervised learning overcomes the shortcoming of them. With a small number of labeled data and a large pool of unlabeled documents, it performs better. Under the assumption of machine learning theory, the texts are selected randomly and training and test data distributions should be identical. In practice, this assumption may not hold for semi-supervised. The unlabeled examples usually come from the different distribution and complicated environment. So the small numbers of training examples are not adequate to the support of searching function in the assumption space. This paper proposed more effective and robust technique to learn from positive and unlabeled examples.This paper has done the work of several respects of the following mainly:1. Labels the documents that include latent class variables by using Bayes latent semantic model, combined with EM iterative algorithm. We also introduce the relevant clustering algorithm for initialing the training examples and the EM maximum likelihood estimation2. We use an entropy-based method with a Gaussian distribution to generate a frequency for unlabeled examples. Based on the information entropy, we determine the feature's distribution, and whether the document is suitable for this classification. The entropy values show the relative discriminatory power of the word features. The bigger a feature's entropy is, the more likely it has the similar distribution with the training examples.3. Propose two kinds of active learning strategies, which are integrated with our classification. To determining whether a document in unlabeled examples is suitable for this classification. Which documents in unlabeled examples could be used as the new training documents?4. A novel efficient method for BBS sentiment classification is presented by using maximum entropy model. the semantic tendency identification was studied. We identified whether the words had semantic tendency by using maximum entropy model. The words with polarity were selected as features and our Support Vector Machine classifier was built.

Keywords/Search Tags:

text classification, semi-supervised, information-entropy, active- learning, sentiment classification, machine learning

PDF Full Text Request

Related items

1	The Platform Design And Implementation Of Text Sentiment Classification Based On Semi-supervised
2	Research Of Text Classification Algorithm Based On Semi-supervised SVM Active Learning
3	Semi-supervised Learning And Active Learning Of Sentiment Classification Coupled With Domain Knowledge
4	The Design And Prototype Implementation Of Sentiment Analysis System Based On Semi-supervised Learning
5	Research On Active Learning For Sentiment Classification
6	Text Emotion Analysis Technology Based On Semi - Supervised Machine Learning
7	Sentiment Classification Research Based On Semi-supervised Learning
8	Semi-supervised Sentiment Classification Based On Ensemble Learning With Voting Combination
9	Research On The Adaptation Of Emotional Classification Field Based On Semi - Supervised Machine Learning
10	Visual Data Classification Based On Active Learning And Semi-supervised Learning