Font Size: a A A

Research Of Instance-based And Feature-based Transfer Learning For Text Classification

Posted on:2016-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y WeiFull Text:PDF
GTID:2308330461468317Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of network and information technology, the volume of information on Internet has been rising exponentially. How to extract the necessary information from the massive amounts of data quickly and efficiently has become an important and demanding issue to be solved. Text classification, a process of assigning predefined categories to test documents, is a significant tool for handling this issue. Text classification is usually done by machine learning, and these methods require that the distribution of training and test data should be identical. However, this requirement is often unable to meet in practical application. Due to the change of time or scenario, the training data will be out of date. The distribution of training and test data will be different, and the learned classification model won’t be applicable. Transfer learning as a new learning method can solve this problem effectively. Based on the text classification as a background, this paper focus on instance-based and feature-based transfer learning, and proposes two kinds of transfer learning methods that are suitable for text classification.To solve the problem of instance-based transfer learning method TrAdaboost that there exists many source data which are dissimilar with target data are retained in the training process, a transfer learning method for the training set optimization and dynamic reconstruction is proposed. The proposed method uses clustering technology on training set, the result is that data in a cluster has a high similarity, and data between clusters has a low similarity. The source data that are not clustered together with target data are deleted, which can be called training set optimization. Then perform the TrAdaBoost method on this optimized training set.set the lowest weight threshold, dynamically delete source domain data which weight are blow this threshold, make sure source domain data have a certain proportion in training set. Experimental results show that the proposed method can remove source data which do little in training progress and improve the accuracy of text classification.In order to overcome the shortcoming of feature-based transfer learning method TPLSA which only considers the case that two domains share common topics and ignores the source specific topics, a transferring learning method based on domain semantic correlation mining is proposed. There are three steps in this method. Firstly, mine domain common topics and specific topics, compute the similarity of common topics and specific topics, compute the correlation between domain specific topics, and get mapping matrix. Then construct a new feature space, express source domain texts in this new space. The express consists of two parts, one part is distribution of text on common topics, and the other part is distribution of text on specific topics. The express of target domain text also consists of two parts, one part is distribution of text on common topics, and the other part is distribution on target specific topic which is mapped to source specific topics. Finally, train classifier on source domain data and classify target domain text in new feature space. Experimental results show that the proposed method considers domain common and specific topics at the same time, which can overcome the shortcoming of TPLSA, and the classification accuracy is improved compared with TPLSA.Experiments are performed on SRAA、20Newsgroup、Reuters-21578 data sets to test the proposed methods. Compare with non transfer learning methods, the results verify the availability of transfer learning. Compare with transfer learning method TrAdaboost and TPLSA, the results prove the superiority and feasibility of the proposed methods.
Keywords/Search Tags:Text Classification, Transfer Learning, Source Domain, Target Domain
PDF Full Text Request
Related items