Font Size: a A A

Improved Word Embedding And K-nearest Neighbor Algorithm For Chinese Text Classification

Posted on:2022-03-30Degree:MasterType:Thesis
Country:ChinaCandidate:C J MaFull Text:PDF
GTID:2518306551498324Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Driven by the development of Internet technology and the progress of mobile social networks in China,the amount of Chinese text information is growing rapidly,and contains great potential value.How to classify Chinese text with high speed and accuracy has important research value and significance.Base on this,this thesis improves the word embedding and classifier construction in text classification respectively,and improves the accuracy and calculation efficiency of Chinese information processing(CLP).Finally,the tourists are classified by the two methods proposed above,which verify the effectiveness and practicability of this method.Firstly,this thesis improves the word embedding method.Different from the traditional alphabetic language processing method,it combines the three-dimensional structure of sound,shape and meaning of Chinese characters,improves the continuous bag of words model,and proposes a two-channel model including Chinese internal features(pronunciation,font)and external features(semantics dependent context information).By comparison with experimental correlation algorithm,the results show that the two channel model based on continuous bag of words is more effective for Chinese text word embedding.Then,this thesis improves the classifier algorithm.Aiming at the insufficient amount of redundant calculation of the kNN algorithm,the clustering algorithm is used to divide the sample data into multiple clusters,and the double objective function is used to obtain the closer to the point to be measured.Clusters and cluster centers achieve the purpose of screening samples to increase the computing speed.Through experimental comparison with related algorithms,the results show that the kNN algorithm(TS-kNN)based on double filtering for text classification can improve the classification speed without affecting the accuracy,and get the text classification results faster and more accurately.Finally,in order to reflect the practicality of the algorithm,this thesis obtains the evaluation information of tourists through crawlers,uses a two-channel model based on continuous bag-of-words to vectorize these text data,and tourists are classified by classified by TS-kNN.According to the classification results,explain the main characteristics of each type,and provide relevant suggestions for tourist attractions that can improve tourist satisfaction.It is greatly improved the accuracy and computational efficiency of Chinese text processing for the improved two-channel model based on continuous bag of words for word embedding and the TS-kNN algorithm for classification.Verified by examples,the algorithm is also very powerful practical value.
Keywords/Search Tags:Chinese text classification, Improved continuous bag-of-words model, Improved k-nearest neighbor algorithm, Tourist classification
PDF Full Text Request
Related items