Font Size: a A A

Improved K-nearest Neighbor Algorithm And Its Application In Text Analysis

Posted on:2021-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:A SunFull Text:PDF
GTID:2428330614963650Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the development of the Internet,the combination of the real economy and the Internet is getting closer and closer,and people are shifting from offline consumption to online consumption.A large amount of comment information on physical consumption that people participate in has accumulated on the Internet.These review information is an important reference basis for customers' online consumption,and also an important reference for business entities to make business decisions.Therefore,it is reasonable and effective to mine review data and extract the concerns and emotional factors in customer consumption reviews.The k-nearest neighbor algorithm is widely used in the field of machine learning and data mining because of its simple theory and easy implementation.However,the traditional k-nearest neighbor algorithm cannot deal with the problems of high feature dimension in text analysis and semantic interpretation after text digitization.It is the research focus of this paper to improve and optimize the traditional k-nearest neighbor algorithm to make it more suitable for text analysis processing scenarios.The main innovative points of this paper are as follows:Firstly,aimed at the disadvantages that the word vector obtained after text segmentation ignores the contextual semantic relationship and resulting in algorithm's low accuracy rate,and introduces a combination feature,which combines the connected entity words and emotion modifiers to form a combination feature,so that the word vector remains basic semantic relations,improve the accuracy of the algorithm.Secondly,feature selection based on TF-IDF and Gini Impurity Structure Feature Screening Comprehensive Index(TF-GINI),while compensating the shortcomings of TF-IDF ignoring the categorical variables in supervised learning samples while reducing feature dimensions and improving algorithm efficiency,and weighted by TFGINI value,and take TF-GINI value as the weight to carry out the weighted k-nearest neighbor algorithm to improve the fitting degree of the algorithm.Through experimental simulation,the weighted k-nearest neighbor algorithm after feature selection has high accuracy and fast fitting speed.Lastly,the k-nearest neighbor algorithm only stores training samples in the training phase.When the data set is large,the k-nearest neighbor algorithm has a high storage cost.In the prediction phase,the overall sample is used to search for the neighbor samples,and the discrete characteristics of the data set itself are not considered,resulting in the search speed of the neighbor samples is slow and the quality of the neighbor samples is not high.To solve this problem,a k-nearest neighbor algorithm based on k-means clustering is proposed.The k-means clustering algorithm is used to divide the data set into multiple dense subsets.The k-nearest neighbor algorithm is fitted in the subset to improve the quality of the neighbor samples and the neighbor sample search speed,thereby improving algorithm performance.Simultaneously training prediction models independently with multiple subsets is conducive to distributed storage and computing.Experiments show that the k-nearest neighbor algorithm based on k-means clustering can get the best fitting effect.
Keywords/Search Tags:Data Mining, K-Means Clustering, K-Nearest Algorithm, Feature Selection, Gini Impurity
PDF Full Text Request
Related items