Font Size: a A A

Research On Text Classification Model And Algorithm For Small Dataset

Posted on:2018-08-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:1318330512988095Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The machine learning algorithms have suffered from many difficulties such as low accuracy and large computing resource consumption in text categorization due to the explosive growth of text data that is characterized by high dimension and sparsity.On one hand,the classification algorithm with high classification accuracy,such as support vector machines and artificial neural networks,cannot be successfully used in massive data mining and online problems because of the poor efficiency.On the other hand,the algorithms with linear time complexity,such as centroid-based classifier,Naive Bayesian and logistic regression,usually lead to low accuracy.Thus,we focus on the research of method to produce “small”(i.e.,low dimension and small number)dataset,which can reduce the massive dataset to a manageable size,and the research of method to improve the accuracy of the linear classification model for small dataset.Specifically,the contributions of this paper are as follows:A critical aspect of dimensionality reduction is to assess the quality of selected(or produced)feature subsets properly.Thus,a new statistics index method called LW-index for evaluation of feature subset and dimensionality reduction algorithms in general is proposed.The proposed method is a type of “classical statistics” approach that uses the feature subset to compute an empirical estimate of the quality of feature subset.The traditional feature subset assessment in machine learning refers to split a given feature subset into a training set,which is used to estimate the parameters of a classification model,and a test set used to estimate the predictive performance of the model.Then,averaging the results of multiple splitting(i.e.,Cross-Validation,CV)is commonly used to decrease the variance of the estimator.However,CV scheme is very computationally expensive.The experimental results indicate that LW-index has the same performance as the traditional CV scheme for evaluating the dimensionality reduction algorithms and it is more efficient than the traditional methodology.The wrapper feature selection method can achieve high classification accuracy.However,the cross-validation scheme of the wrapper method in evaluation phase is very expensive regarding computing resource consumption.Thus,a new feature selection method,which is the combination of the proposed LW-index with Sequence Forward Search algorithm(SFS-LW),is presented.Further,we show through plenty of experiments that the proposed method can obtain similar classification accuracy as the wrapper method with centroid-based classifier or support vector machine,and its computation cost is approximate to the compared filter methods.It is inefficient or impracticable to implement support vector machine in dealing with large scale training set due to its computational difficulties as well as the model complexity.Thus,we study the support vector recognition problem mainly in the context of the reduction methods to reconstruct training set for support vector machine.We focus on the fact of uneven distribution of instances in the vector space to propose an efficient self-adaption instance selection algorithm from the viewpoint of geometry-based method.The existing instance selection algorithms based on nearest neighbor or clustering techniques suffer from many serious difficulties,such as lack of memory and long processing time,in face of millions of records in their common applications.Also,the extensive experimental results show that the proposed algorithm outperforms most of competitive algorithms due to its high efficiency and efficacy.The Centroid-Based Classifier(CBC)is a widely used method in text categorization due to its theoretical simplicity and computational efficiency.However,the classification accuracy of CBC greatly depends on the data distribution,so it will lead to a misfit model and poor classification performance when the data distribution is highly skewed.Thus,a new classification model named as Gravitation Model(GM)is proposed to solve the unbalanced data classification problem.In the training phase,each class is weighted by a mass factor,which can be learned from the training data,to indicate data distribution of the corresponding class.In the testing phase,a new document will be assigned to a particular class with the max gravitational force.A new AAC-SLA algorithm,which is a combination of the Arithmetical Average Centroid(AAC)with the Stochastic Learning Mass(SLA)algorithm,is proposed to solve the gravitation model.The results of experiments show that the proposed AAC-SLA consistently outperforms CBC together with its variants.Also,it obtains the classification accuracy competitive to the best centroid-based method while maintains a more stable performance.A new MEB-SLA algorithm,which is a combination of the Minimum Enclosing Ball(MEB)and the Stochastic Learning Mass(SLA)algorithm,is also proposed to solve the gravitation model.The MEB algorithm can avoid the influence of the random distribution of the sample on the arithmetical average centroid.The experiments indicate that MEB-SLA consistently outperforms AAC-SLA in the text datasets.Moreover,MEBSLA and AAC-SLA are better than SVM in the small datasets.Finally,we produce some “small” datasets,of which the numbers of dimensionalities and instances are about 1/10 times less than their original sizes,by using SFS-LW and SE algorithms proposed in this paper.The experiments conducted on these datasets show that AAC-SLA and MEB-SLA outperform the competitive SVM(Support Vector Machine).Moreover,the accuracies of AAC-SLA and MEB-SLA only have a slight decline in comparison with that trained by the orginal datasets.The conclusions of this paper are:(1)MEB-SLA is suitable for learning from the small/middle-scale datasets;while(2)AAC-SLA combined by SE should be an ideal choice for learning from the large-scale datasets.
Keywords/Search Tags:Text categorization, Machine learning, Large/Small dataset, Feature selection, Instance selection
PDF Full Text Request
Related items