Font Size: a A A

The Study Of Data Classification Based On Spatial Geometry

Posted on:2018-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:L WangFull Text:PDF
GTID:2348330518959432Subject:Operational Research and Cybernetics
Abstract/Summary:PDF Full Text Request
As we all know,data classification is the basic and key point of data mining technique.And the advent of big data is accompanied by the constant increase of the amount of data and data dimension,which lead to the huge limitation of the traditional data classification techniques.In order to classify the data accurately and quickly,it requires the mutual cooperation of several data processing techniques.And this thesis mainly divides the data classification into four parts of visualization,feature extraction,data classification and comparison validation.In the first part,it firstly analyzes the great effects of dimensionality reduction of the high dimensional data,such as the prevention of the curse of dimensionality,the avoidance of noise and the visualization of data.Based on the main ideas and the comparision of applied range of several familiar techniques of data dimensionality reduction,this thesis selects the principal component analysis as the main method of dimensionality reduction in the wheat seeds data.And the value of sphericity test is0.788,P is 0,the two factors explain 88.982% of the vanance.In the second part,this thesis notices that most of the traditional data classification techniques mainly ignore the geometry of the data set in space.Therefore,with the help of the concave-convex shape of the data set in space,this thesis analyzes the characteristics and conditions of the critical point set which establishes two kinds of geometry.And after deviding the spatial relations of the geometry into overlapping regions and no overlapping regions,it carries on the critical point test on the basis of the Bayesian Probability.In the empirical part,the training samples after dimensionality reduction are used in the critical point test,and get 8 points in overlapping region and 13 points in non-overlapping domain.In the third part,with analyzing the basic theory of data classification in Statistics,it proposes the way of using the support vector machine to seek for the minimization of structural risk so as to realize the minimization of empirical risk and confidence range as well as the generalization ability of learning machine.At the same time,this thesis derives the support vector machine classifier by analyzing the maximum interval classification.Since the support vector machine classifier relates to several parameters such as the kernel function,it tries in the empirical part to findthe best combination of parameters through analyzing the dynamic spatial relations in Gauss kernel function parameter g2,support vector equilibrium parameter C and test sample classification correct rate of P.And as a result,it can find the best parameter combination and get g2=20.30303,C=29.3939.In the last part,since the support vector machine classifier itself has the characteristic of feature extraction,this thesis makes a contrastive analysis of the test data from two aspects of classification accuracy and the algorithm running time in order to confirm the necessity of extracting critical feature point.In this analysis,the data after feature extraction is regarded as the experimental group,while the data which is not extracted is regarded as the control group.According to the results,the P is 95% and the running time is 3.3890 s in experimental group,the P is85% and the running time is 4.130 s in control group.So extracting feature data through spatial geometry not only get data information quickly and maintain high classification accuracy but also play a key role for data classification.
Keywords/Search Tags:feature extraction, Bayesian Probability, spatial geometry, support vector machine, structural risk
PDF Full Text Request
Related items