Font Size: a A A

Research And Application Of Sample Selection In Machine Learning

Posted on:2018-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2348330512989076Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of science and technology,information waves are blowing.People will produce a lot of data in the daily life.These data contain much information with very high value.Dealing with this large amount of data is a huge challenge for existing analytical methods and tools.Data mining technology to deal with these data problems has been a top research.There are many branches in data mining.One of the most popular branches is machine learning.With the development of machine learning technology and theory,machine learning algorithms have been applied in some specific areas,such as license plate recognition,network attack prevention,handwriting character recognition,face recognition,information retrieval,social network and disease diagnosis.However in order to analyze data problems,machine learning methods often require a large number of data sets to be trained.It is through training to explore the law,and the establishment of the model.And then use the model to predict.Despite the large number of breakthroughs in the optimization of training algorithms,machine learning methods are still plagued by a large training set.The result is that the training time of model is long.At the same time,there are some redundant data and outliers in these data sets.These redundant data are those of non-critical data points in training of machine learning,which takes up a lot of computing resources,making the training process of machine learning model very time-consuming,and even affect the accuracy of model.In this paper,the exiting sample reduction algorithm and outlier detection algorithm are researched to solve the quality problem of these data.And proposed a new sample reduction strategy and improved outlier detection method.In order to solve the problem of large scale of data,this paper presents a sample reduction method,i.e.Shell Data Selection Algorithm.Since the data set distribution is not absolutely uniform,the algorithm deletes the data points near the center of the data set in successive iterations.This preserves the non-redundant data points that are distributed in the shell area of the sample set.The goal of reducing the size of the training set is achieved without compromising the accuracy of the model after training.Then,based on the shell-like data selection algorithm,an improved outlier detection strategy is proposed.Traditional detection methods too complex to be applied to large-scale data sets directly.At the same time,it is not difficult to find that most of the whole data set is non-outliers.So the traditional outlier algorithm wastes most of the time in traversing non-segregation points.In order to reduce the traversal of non-outliers,the improved outlier algorithm uses the shell-like data selection algorithm to remove most non-outliers.Then the dichotomy algorithm is used to divide the reduced data set into multiple sub-regions.Then the sub-regions are sorted.At last the kNN algorithm is used for outlier analysis.This not only retains the effect of the original outlier detection method,but also greatly improves the efficiency of outlier detection.
Keywords/Search Tags:Sample selection, Outliers, kNN, Shell data
PDF Full Text Request
Related items