Font Size: a A A

The Research Of Minimum Label Problem In Marine Literature Classification

Posted on:2010-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y H JiangFull Text:PDF
GTID:2178360275985950Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
It is important for the research of marine subjects to classify the marine literature efficiently. At present, text categorization based on supervised learning is a mature technology to solve the problem. But supervised learning always needs too much labeling work; moreover, labeled resource cost a lot, and large numbers of unlabeled resource is not in use.Semi-supervised machine learning can make use of a little labeled resource to obtain useful information from numbers of unlabeled resource, so it can reduce the manual labeling work.Thus, this thesis takes advantage of semi-supervised machine learning to study the minimum label problem in marine literature classification.This thesis starts with describing the basic conceptions of text categorization and machine learning, and then it introduces the three basic technologies of text categorization based on machine learning one by one. And it selects the most suitable classification algorithm based on the result of algorithm-comparing experiments.And then, this thesis describes the semi-supervised machine learning problem, sequentially it introduces the core algorithm of this thesis——co-training.Finally, this thesis realizes the minimum label of marine literature classification based on co-training with c#.net programming.The main work and innovation of this thesis are as follows:(1) This thesis presents the particular flow of marine literature classification based on co-training. It also designs six function module——text pretreatment, feature spliting, train, predict, feature selection ang evaluation. The feature spliting module is the distinct module for the co-training algorithm, it is the key part of this thesis's work. (2) This thesis splits the features into two views by adding each feature a tag. Then it trains two different classifiers from two views. And then it confirms the co-training iterations and the proper number of samples in the pool by a series of experiments so that it can make the classification effects well.(3) At last, it compares the effects of co-training and supervised learning. Experiments shows that even if there are only 2 labeled samples in the training set, the F1 value and error rate of the classification system could reach about 85.88% and 14.35%. They are close to the performance of supervised classifier (90.20% and 9.13%) which is trained by more than 1500 labeled samples.These show that the application of co-training on marine literature classification can significantly reduce the manual work, and also has well performance. Thus, it is very suitable for practical applications.
Keywords/Search Tags:marine literature, text categorization, machine learning, semi-supervised learning, co-training
PDF Full Text Request
Related items