Font Size: a A A

The Text Classification Improvement Research Of Transductive Transfer Learning Algorithm Based On TrAdaBoost

Posted on:2017-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:X X LiFull Text:PDF
GTID:2348330503972498Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Text classification(TC) based on transfer learning can train correct classifiers when there is no enough labeled data and distributions of training data set and testing data set are not same. TC based on transfer learning solves the problem of lacking labeled data in the era of big data, meanwhile, it can save labor costs. Thus, TC based on transfer learning becomes one of the research focuses currently.Two aspects were studied in this paper: feature selection( FS) methods and TC algorithm based on transfer learning. It is our purpose that building a more reliable transfer learning classifier denoted as Tr SN than Tr N( Tr Ada Boost with Na?ve Bayes) and Tr S( Tr Ada Boost with Supported Vector Machine),when distributions of training data sets and testing data sets are not the same. FS is the core of pre-processing stage and impacts the TC efficiency directly. We proposed a novel method categorical document frequency divided by categorical frequency( CDFDC) based on categorical document frequency( CDF). Tr Ada Boost is a commonly used document-weight-based TC algorithm framework. In this paper, we designed and implemented Na?ve Bayes with document weight( dw NB) and SVM with document weight( dw SVM) and we proposed a new method integrating dw NB and dw SVM with Tr Ada Boost.CDFDC's comparison was conducted with dataset 20 news group. The distributions of training data sets and testing data sets are same. NB was used as the classifier in the experiment. The comparison results show that CDFDC classification efficiency is the best among 6 FS methods and CDFDC's classification precision reaches 0.77 with the running time slightly longer than shortest-running-time-method CHI, when the number of selected features is bigger than 3,000. The comparison of Tr SN was conducted with dataset 20 news group, too. The distributions of training data sets and testing data sets are correlated but different. Tr SN's classification precision is 0.94 on average and classification time is between 400 to 1036 seconds. Tr SN's precision is the highest among 7 classification methods in the experiment. Tr SN's running time is much shorter than Tr S and the precision is much higher than Tr S. Tr SN's precsion is higher than Tr N, too. In summary, the novel method Tr SN which integrates Tr Ada Boost with two based TC algorithm SVM and NB performs better than the Tr Ada Boost with single based TC algorithm denoted as Tr S and Tr N, when labeled data is not enough and the distribution of training data sets and testing data sets are correlated but different.
Keywords/Search Tags:Transfer Learning, Text Classification, Tr Ada Boost, Feature Selection
PDF Full Text Request
Related items