The Research Of Accelerated Learning Algorithms Based On Partition And Condensation

Posted on:2019-10-12

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y S Song

Full Text:PDF

GTID:1368330551958767

Subject:Systems Engineering

Abstract/Summary:

PDF Full Text Request

Many complex systems operating in the real world,such as gene expression,risk assessment,and economic forecast can be abstracted as some specific prediction problems,and the efficient resolution of these prediction problems will be a great significance for the production and life of society.Machine learning is an important method to solve these complex prediction problems.It can improve its own learning and forecasting ability by continuously learning from experience.At this stage,with the rapid development of information technology and the outbreak of big data in various fields,the data scale in many practical applications has shown an explosive growth.The existing machine learning methods face a great challenge in solving prediction problems for the large scale of data.Therefore,the study of efficient machine learning algorithms has an important practical value and theoretical significance.Therefore,the study of efficient machine learning algorithms has important practical value and theoretical significance.Supervised learning is the most widely application and the most abundant content in machine learning,the issue that how to efficiently establish a learner with the strong generation ability using the mass data is also one of the key problems of machine learning research.In this paper,on the basis of the data partition and data condensation,efficient supervised learning algorithms for the massive data are systematically studied,the main research results are as follows:(1)For the problem that support vector machine needs a lot of training time for dealing with the massive data,an efficient support vector machine algorithm based on local geometry information is proposed according to the idea of divide-and-conquer.For the given massive data,based on the characteristic of the decision function for support vector machine algorithm determined by a few support vectors,we use the linear projection to explore the decision boundary in the current data,and then deeply analyze how to divide the data,train the classifier on the divided subsets and fuse the classifiers these problems,finally a high efficiency support vector machine algorithm is constructed.The proposed algorithm is compared with three state-of-the-art acceleration algorithms,and the experimental result shows that it has the similar classification accuracy while its efficiency has been greatly improved.(2)Based on the theory of optimization learning,we give the mechanism analysis of the k-nearest neighbor classification acceleration algorithm based on data partition.According to the local characteristics of k-nearest neighbor classification algorithm,we transform the process of searching k neighbor process into an optimization problem,and estimate the difference of the objective function under the optimal solution between the original optimization problem and the problem using the data partition.On this basis,we reduce this difference by using clustering,and then design the k-nearest neighbor classification acceleration algorithm based on k-means clustering,which provides a research idea for data processing in the context of mass data.The relevant mechanism analysis proposed in this paper consolidates the theoretical basis of accelerated learning algorithm based on data partition.(3)For the problem that k-neighbor algorithm has the high computational cost when executing the prediction process,k-neighbor classification algorithm and k-neighbor regression algorithm basing on the idea of sample estimation are proposed respectively.Through exploring the distribution training instances in the input space,we present a data layer mechanism and give an efficient k-neighbor classification algorithm to predict combing the characteristics of the stratified sampling;Based on the local characteristics of k neighbor regression algorithm,we construct a measure that evaluate the contribution of a single instance to the regression model,and then we give the criteria to identify the noise instances and indistinctive instances,finally we construct an effective mechanism to remove them.The experimental result shows that it obtains higher execution efficiency and lower instance storage rate than five state-of-the-art condensation mechanisms for the k-nearest neighbor algorithm.(4)For the problem that it is inefficient to compute the gradient in the training process of Logistic regression algorithm for the large-scale data,an accelerator for logistic regression algorithm based on sampling on-demand is proposed.According to the optimization learning theory,the criterion that ensures the gradient estimation in the sample set is the decreasing direction of the target function is given.On this basis,the multivariate estimation problem that meets the criteria is converted into several univariate estimation problems,then we design a adaptive mechanism to determine sample size basing on the information from the obtained sample.In addition,the gradient estimation obtained by this algorithm is theoretically proved to be the decreasing direction of the current objective function.This proposed algorithm solves the difficulty that random sampling is necessary to determine the sample size in advance,and it has deepened the theoretical research on accelerated learning algorithm based on random sampling.For the problem of low executing efficiency of traditional machine learning algorithms facing with massive data,a series of accelerated learning algorithms have been proposed based on data partitioning and data condensation.The relevant experimental results verify the effectiveness and efficiency of the proposed algorithm,and it provides the new technical support for intelligent information processing.

Keywords/Search Tags:

Large-scale data, Data partition, Data Condensation, Local information, Support vector machine, random sampling, Nearest neighbor, Logistic regression

PDF Full Text Request

Related items

1	The Application Of Data Mining Methods In Credit Card Default Prediction
2	Identification Of Source And Information Processing Strategy Of Wireless Radio Frequency Signal
3	Research On Automatic Classification Algorithms For All-sky Cloud Image
4	Research And Application Of The Support Vector Machine On Large-scale Datas
5	Research On Support Vector Machine For Large Scale Imbalanced Data
6	Some Studies On Subsampling And Variable Selection In Large-scale Data
7	Dynamic Granular Support Vector Machine Model For Classification
8	A Study Of The Positioning Method Driven By Machine Learning Of Urban Road Test Data
9	Application Research Of Multi-View Multi-Task Classification/Clustering Algorithms For Large Scale Data
10	Sublinear Algorithms For Large-scale Kernel Learning