Font Size: a A A

The Research On Marine Literature Categorization And Labeling Minimization

Posted on:2012-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q H WangFull Text:PDF
GTID:2218330338964967Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Automatic text Categorization is an important research area of data mining and machine learning. In this thesis we use automatic text categorization techniques to solve the problem of marine literature categorization. As the development of marine science and information technology, literatures of marine disciplines are in large numbers. It is an urgent problem for their retrieval and categorization. Information processing by hand is inefficient and time-consuming. So the reality of marine literature categorization problem is that, on the one hand, there are very scarce labeled samples and it is difficult to get them, on the other hand, unlabelled samples are very rich, and it is easy to get them, but they are put aside with no use. The data in unlabeled samples can't be directly used for training traditional classifier, but we can analyze the structure of data and the distribution information from them. If we can make full use of the information by machine learning method, the performance of categorization algorithm will be improved effectively. So we use semi-supervised learning and active-learning to solve this problem.Traditional machine learning methods are divided into two kinds: supervision machine learning method and un-supervision machine learning method. Supervised machine learning requires all samples labeled in training set, however un-supervision machine learning only requires the training set without labels. Supervised machine learning method has more desirable result, but it often requires hundreds or even thousands of labeled training samples. However, unsupervised machine learning method wastes valuable labeled resources, and can't ensure the accuracy of categorization. Thus, semi-supervised machine learning method is a better choice. It mainly research on how to combine a large number of unlabeled samples with a few labeled samples in order to enhance the classifier's generalization ability. By setting the co-training algorithm's parameters, we can get the better results.The semi-supervised method can use the unlabeled samples to build the categorization model, but sometimes it chose the wrong samples. Active learning can lower the bias and the variance by select the right samples. Active learning focuses on teaching the machine to learn how to use their own capacity to achieve an effective categorization performance by simplifying the steps and using the lowest cost. By combining it with active-learning, it can get better categorization performance.The purpose of this thesis is to construct a marine literature categorization system based on minimizing the labeled samples. The main works of this thesis are as follows:(1) Build a text categorization model between the marine literatures and non-marine literatures, which is a binary text categorization problem.(2) Build a text fine categorization model, which automatic classify marine literatures to each subclass, it is a multi-class text categorization.(3) Based on the models above, we build a complex text categorization system by minimizing the labeled samples. Through the introduction of semi-supervised learning, we constructed a marine literature categorization system, which use only few labeled samples, and in order to improve the accuracy of categorization, we consider the combination of semi-supervised learning and active-learning.The implementation of the system will help improve the efficiency of the marine literature search and get more literatures of marine areas in order to make full use of these resources.
Keywords/Search Tags:Marine literature categorization, Maximum Entropy, SVM, J48, Semi-supervised learning, Co-training, Active learning
PDF Full Text Request
Related items