Font Size: a A A

Research On Text Classification Based On Firefly Algorithm And Improved KNN

Posted on:2021-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhaoFull Text:PDF
GTID:2428330614958341Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of information technology,today's network users are not only information consumers,but also information producers.The network is full of a large number of disordered information in the form of text.In the face of massive data,it is difficult for users to find valuable information for themselves.Text classification is the key technology to solve this problem,which can effectively organize and manage the text data on the network.However,the current text classification technology has some problems,such as low accuracy of feature subset,high dimension and low classification efficiency.In order to solve these problems effectively,this thesis mainly improves the research from the following two aspects:1.Aiming at the problem that the accuracy of feature subset obtained by traditional feature selection method is not high,a text feature selection model based on information gain and firefly algorithm is proposed.Firstly,the information gain method is used to select the feature pre selection set with large information gain value from all feature words,and then the firefly algorithm is used to search for a better feature subset on the set.In order to improve the slow convergence speed and easy to fall into local optimum of firefly algorithm,a dynamic update step factor is introduced.In the early stage of algorithm search,the step size is relatively large,which can make a good global search;in the later stage,the step size gradually decreases with the increase of iterations,which can ensure the local search performance of the algorithm and quickly reach the global optimum.The experimental results show that the accuracy of the feature subset selected by the improved algorithm combined with information gain is higher than that of the original algorithm and information gain.The feature selection model can effectively improve the accuracy of text classification.2.In order to solve the problem of low classification efficiency when k-nearest neighbor algorithm is faced with a large number of training samples,a fast k-nearest neighbor classification algorithm based on clustering and central vector is proposed.Firstly,the training texts of each category are clustered by clustering method.Then,the texts of each category are divided into inner region and boundary region,and the center vector is calculated.When the text to be tested is classified,the decision can be made quickly according to its distance from the center vector and the average distance withinthe class.If not,the distance between the text to be tested and the center of each cluster can be calculated.The training sample subset is composed of all the texts in the cluster that are relatively close to it.Finally,the k-nearest neighbor algorithm is used to make the classification decision on this subset.The experimental results show that the performance of the improved algorithm is similar to that of the traditional k-nearest neighbor algorithm,but the classification time is significantly reduced,which can effectively improve the efficiency of text classification.
Keywords/Search Tags:text classification, firefly algorithm, feature selection, clustering, central vector
PDF Full Text Request
Related items