Font Size: a A A

Research On KNN Algorithm Optimization Issues In Text Classification

Posted on:2019-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:F F WangFull Text:PDF
GTID:2348330566464296Subject:Engineering
Abstract/Summary:PDF Full Text Request
The flourish of communication technology has boosted the growth spurt of text messages.However,the text classification technology is able to standardize the jumbled text messages and is of active significance such as saving searching time and improving searching efficiency on the aspects of information searching and so on.Among the various text classification technologies,KNN text classification algorithm is a type of mature and stable text classification method.However,KNN algorithm is not perfect.Firstly,when applying KNN algorithm,we shall possess the training sample data access to the actual data.However,we shall save all the data sets by applying such algorithm.If the training data set is relatively large,then lots of storage space will be occupied.When the classification is conducted,KNN algorithm shall calculate the distance between every text needs to be classified and all the known samples and then identify the k texts with shorter distance by formulas.When the vector sample scale of high-dimensional text is large,the algorithm time and algorithm space will increase the complexity correspondingly causing numerous invalid calculated amounts.Secondly,the sample distance measurement of KNN utilizes Euclidean distance enabling that the weights at each dimension are the same.The same weight will cause that the distance calculation in KNN is not precise so as to result in that the classification precision will be influenced to some extent.Finally,in KNN algorithm,it is especially essential to determine the value of K.If the value of K is too large,it will contain some samples that have no connection with samples need to be classified;the calculated amounts will be increased and the classification result will be reduced;if the value of K is too small,it will miss some samples that have connection with samples need to be classified;these samples may be related to the classification determination of samples need to be classified;thus,it will cause the classification determination errors of samples need to be classified.For purpose of solving the above-mentioned problems,this paper conducts the following researches on the aspects such as text preprocessing and optimizing the value of K.In the first place,aiming at the insufficiencies of KNN algorithm such as occupying large storage space,numerous invalid calculated amounts as well as the classification accuracy influenced by the same weights at each dimension in KNN algorithm when the sample scale is large in the vector space model(VSM),this paper combines weighted-PCA SOM neural network and KNN algorithm to propose a type of improved PCA-SOM-KNN algorithm,which is more accurate and rapider than KNN algorithm.For purpose of increasing the performance of KNN algorithm,this paper selects to utilize SOM neural network to conduct dimension reduction processing toward data and utilizes principal component analysis to effectively solve the distortion problem emerged by that SOM neural network maps from high dimensions to low dimensions.It introduces the variance contribution rate ofprincipal component into the Euclidean distance function by taking it as weights to solve the classification accuracy influence caused by the same weights.Through the experiment,it is demonstrated that the PCA-SOM-KNN algorithm proposed by this paper is able to effectively reduce the vector dimensions.From the perspective of the experiment result,it is able to increase the accuracy of classification compared with the traditional KNN algorithm.In the second place,aiming at problems such as the large scale of text calculated amounts in KNN algorithm,high usage of resources and difficult to determine the value of k,this paper optimizes KNN algorithm by utilizing K-means algorithm and genetic algorithm.It applies K-means algorithm to conduct Euclidean distance cluster toward training samples;the original data is clustered into m clusters and then the distances between each identified sample and the central vectors of the m clusters will be calculated.The samples in the first n clusters with the shortest distance will be selected.KNN classification method is applied to conduct classification,where the amount of cluster m,the value of K in KNN and the value of n,the amount of close cluster,will be obtained by utilizing iterative computations of genetic algorithm.The paper conducts improvements toward the traditional KNN text classification algorithm on the aspects such as text vector dimension reduction,calculation cost reduction as well as optimizing the value of K in KNN algorithm so as to increase the accuracy of KNN text classification algorithm.Through the experiment on data sets,it is demonstrated that the method this paper adopts is able to effectively achieve better classification result.
Keywords/Search Tags:text classification, KNN, genetic algorithm, SOM neural network, K-means
PDF Full Text Request
Related items