The Research Of Minimum Label Problem In Marine Literature Classification

Posted on:2010-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Jiang

Full Text:PDF

GTID:2178360275985950

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

It is important for the research of marine subjects to classify the marine literature efficiently. At present, text categorization based on supervised learning is a mature technology to solve the problem. But supervised learning always needs too much labeling work; moreover, labeled resource cost a lot, and large numbers of unlabeled resource is not in use.Semi-supervised machine learning can make use of a little labeled resource to obtain useful information from numbers of unlabeled resource, so it can reduce the manual labeling work.Thus, this thesis takes advantage of semi-supervised machine learning to study the minimum label problem in marine literature classification.This thesis starts with describing the basic conceptions of text categorization and machine learning, and then it introduces the three basic technologies of text categorization based on machine learning one by one. And it selects the most suitable classification algorithm based on the result of algorithm-comparing experiments.And then, this thesis describes the semi-supervised machine learning problem, sequentially it introduces the core algorithm of this thesis——co-training.Finally, this thesis realizes the minimum label of marine literature classification based on co-training with c#.net programming.The main work and innovation of this thesis are as follows:(1) This thesis presents the particular flow of marine literature classification based on co-training. It also designs six function module——text pretreatment, feature spliting, train, predict, feature selection ang evaluation. The feature spliting module is the distinct module for the co-training algorithm, it is the key part of this thesis's work. (2) This thesis splits the features into two views by adding each feature a tag. Then it trains two different classifiers from two views. And then it confirms the co-training iterations and the proper number of samples in the pool by a series of experiments so that it can make the classification effects well.(3) At last, it compares the effects of co-training and supervised learning. Experiments shows that even if there are only 2 labeled samples in the training set, the F1 value and error rate of the classification system could reach about 85.88% and 14.35%. They are close to the performance of supervised classifier (90.20% and 9.13%) which is trained by more than 1500 labeled samples.These show that the application of co-training on marine literature classification can significantly reduce the manual work, and also has well performance. Thus, it is very suitable for practical applications.

Keywords/Search Tags:

marine literature, text categorization, machine learning, semi-supervised learning, co-training

PDF Full Text Request

Related items

1	The Research On Marine Literature Categorization And Labeling Minimization
2	Research On Short Text Categorization Based On Semi-Supervised Learning
3	Research On Text Categorization Method Oriented To Content Security
4	Research On High Performance Chinese Text Classification Based On Machine Learning
5	Research On Semi-supervised Text Categorization Method Based On EM Algorithm
6	A Study On Learning From Positive And Unlabeled Examples
7	Research On The Text Classfication Based On The Semi-supervised Learning
8	Knowledge Discovery From Biomedical Literature Based On Semantic Resources And Semi-supervised Learning
9	Text Emotion Analysis Technology Based On Semi - Supervised Machine Learning
10	Research On Short Text Categorization Based On Phrase-Like Repeat And Semi-Supervised Learning