| In recent years,due to the explosive development of the Internet,data such as text,audio,and pictures on the network have risen almost exponentially.How to make computers process,identify,and analyze these structured and unstructured massive data efficiently and accurately has brought new challenges to the industry and academia.Let the computer process the text,generally through the text representation step,that is,the text is properly represented as a data type(such as a numeric vector)that can be processed by the computer for subsequent feature engineering.Common traditional text representation models include Boolean model,word bag model,LDA model,and word embedding(Word Embedding,also known as word vector)model.Recently,word embedding models are mostly used in the field of machine learning.Common word embedding models include Word2vec and Glove.In the commonly used Word2vec algorithm,whether it is a small corpus-friendly CBOW algorithm or a large corpus-friendly skip-gram algorithm,there is no smoothing when considering context semantics.Specifically,such as the CBOW model,a simple weighted average process is performed on the word vectors of several recent context words to generate a projection layer;or such as skip-gram,random sampling of several words and the head word in the sliding window is randomly sampled with equal probability Tuple.Such a processing method is based on an inappropriate default assumption that within a sliding window,regardless of distance,the influence of the context word on the head word is equivalent.In order to solve this problem,this paper proposes the Word2vec algorithm based on context distance(Context Distance Based Word2vec,CDB-Word2vec).This algorithm proposes an optimized model for each of the two models in the original Word2vec:based on the original CBOW algorithm,a CBOW algorithm(Context Distance Based CBOW,CDB-CBOW)is proposed,which applies Words are given weights from small to large according to the distance from the center word distance;from the original skip-gram algorithm,a skip-gram algorithm based on context distance(Context Distance Based skip-gram,CDB-skip-gram)The algorithm gives the sampling probability of the context words in the sliding window from small to large according to the distance from the center word from far to near.Since it is not easy to directly evaluate the quality of the word vector generated by the model,it is a good method to take the word vector into specific tasks(such as text sentiment classification)and check the final classification accuracy to infer the quality of the word vector.Therefore,this paper uses CDB-Word2vec to generate word vectors based on Wikipedia corpus training,and uses the obtained word vectors to conduct text sentiment classification experiments to evaluate the word vector model.Experiments show that the quality of word vectors generated by CDB-Word2vec is higher than the original Word2vec algorithm. |