Improved K-nearest Neighbor Algorithm And Its Application In Text Analysis

Posted on:2021-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:A Sun

Full Text:PDF

GTID:2428330614963650

Subject:Applied statistics

Abstract/Summary:

PDF Full Text Request

With the development of the Internet,the combination of the real economy and the Internet is getting closer and closer,and people are shifting from offline consumption to online consumption.A large amount of comment information on physical consumption that people participate in has accumulated on the Internet.These review information is an important reference basis for customers' online consumption,and also an important reference for business entities to make business decisions.Therefore,it is reasonable and effective to mine review data and extract the concerns and emotional factors in customer consumption reviews.The k-nearest neighbor algorithm is widely used in the field of machine learning and data mining because of its simple theory and easy implementation.However,the traditional k-nearest neighbor algorithm cannot deal with the problems of high feature dimension in text analysis and semantic interpretation after text digitization.It is the research focus of this paper to improve and optimize the traditional k-nearest neighbor algorithm to make it more suitable for text analysis processing scenarios.The main innovative points of this paper are as follows:Firstly,aimed at the disadvantages that the word vector obtained after text segmentation ignores the contextual semantic relationship and resulting in algorithm's low accuracy rate,and introduces a combination feature,which combines the connected entity words and emotion modifiers to form a combination feature,so that the word vector remains basic semantic relations,improve the accuracy of the algorithm.Secondly,feature selection based on TF-IDF and Gini Impurity Structure Feature Screening Comprehensive Index(TF-GINI),while compensating the shortcomings of TF-IDF ignoring the categorical variables in supervised learning samples while reducing feature dimensions and improving algorithm efficiency,and weighted by TFGINI value,and take TF-GINI value as the weight to carry out the weighted k-nearest neighbor algorithm to improve the fitting degree of the algorithm.Through experimental simulation,the weighted k-nearest neighbor algorithm after feature selection has high accuracy and fast fitting speed.Lastly,the k-nearest neighbor algorithm only stores training samples in the training phase.When the data set is large,the k-nearest neighbor algorithm has a high storage cost.In the prediction phase,the overall sample is used to search for the neighbor samples,and the discrete characteristics of the data set itself are not considered,resulting in the search speed of the neighbor samples is slow and the quality of the neighbor samples is not high.To solve this problem,a k-nearest neighbor algorithm based on k-means clustering is proposed.The k-means clustering algorithm is used to divide the data set into multiple dense subsets.The k-nearest neighbor algorithm is fitted in the subset to improve the quality of the neighbor samples and the neighbor sample search speed,thereby improving algorithm performance.Simultaneously training prediction models independently with multiple subsets is conducive to distributed storage and computing.Experiments show that the k-nearest neighbor algorithm based on k-means clustering can get the best fitting effect.

Keywords/Search Tags:

Data Mining, K-Means Clustering, K-Nearest Algorithm, Feature Selection, Gini Impurity

PDF Full Text Request

Related items

1	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm
2	Research And Application Of Clustering Algorithm Based On Feature Point Selection
3	Based On The Selection Of The Initial Point Of K-means Clustering Algorithm And Its Application
4	The Research Of The K-means Clustering Algorithm Based On Nearest Neighbors
5	K-NN, K-means And The Application In Text Mining
6	Research And Application Of Combinatorial Clustering Methods In The Text Clustering
7	The Research And Application Of Clustering Feature Selection Methods
8	Scmi-superviscd K-means Clustering Algorithm In Data Mining
9	K-means Clustering Algorithm
10	The Research Of Clustering Data Mining Based On Swarm Intelligence Algorithm