Font Size: a A A

Research And Application Of Chinese Short Text Clustering Algorithm Based On Word2Vec

Posted on:2019-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:C MaFull Text:PDF
GTID:2428330566970851Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of the Internet has spawned many derivatives of social networks.In addition to well-known WeChat,Weibo,forums,and e-mails,quizzes and small circles have gradually entered the field of vision.Without exception,the most important value of these products lies in the vast amount of data,which is manifested as short text data.Short texts become the medium for everyday people to share information and spread knowledge,and this also affects people's life and communication habits.It is of great guiding significance to analyze the user's behavior and habits,improve the quality of the search engine,and put the advertisement in the enterprise through the establishment of a mathematical model of the mass short text data.Text clustering is the premise for text analysis or forecasting,which helps us to have an overall understanding and grasp of these textual information.However,short text has the characteristics of lack of features,many dialect idioms,obvious geographical features,good use of homophone,and many new words in the network,which leads to its high dimension,poor performance and drift of clustering results,and the effect of the traditional clustering algorithm in short text is not good.With the continuous development of deep learning,people began to try to use deep learning algorithms to solve Natural Language Processing problem.Word2 Vec is a text processing tool based on deep learning published by Google,which provides a means of seemingly meaningless vector mode to show the text.It is this unregular vector model that effectively solves the problem of the high dimension of the traditional vector space model,and also preserves the rich concurrence information between words and words.In order to solve the problem of short text feature sparsity and improve the quality of short text clustering,a short text clustering algorithm model based on Word2 Vec is proposed in this paper.The main work is as follows:Focusing on the work of the predecessors,this paper expounds the importance of short text clustering in the field of text mining and the difficulties and coping strategies in the process of establishing the clustering model.The paper analyzes the problem of segmentation and removal of stop words in the process of short text preprocessing,and the influence of emotional factors on the clustering effect in the feature selection process.The clustering algorithm,distance function and performance evaluation factors commonly used in the model are briefly introduced.The principle of Word2Vec's underlying algorithm is introduced in detail,and Word2 Vec training word vectors are used based on large-scale corpus,and the traditional VSM model experiments are established to verify the validity of Word2 Vec in preserving text semantics and handling short text feature sparseness problems.Due to the special nature of short texts,the direct application of traditional text clustering algorithms will produce two important problems.First,the contribution of the synonyms to the entire text cannot be recognized.Second,in the preprocessing phase,some of the emoticons or degree adverbs are removed and some of the semantic information is lost.In this paper,part of speech analysis and sentiment analysis are introduced into short text clustering.The word vector model trained by Word2 Vec combined with feature weight selection algorithm is used to improve the text similarity model in clustering algorithm.Under the premise of fusion of part of speech and emotion to improve the problem of poor focus on short texts on the cluster model.It proposes to apply the RWMD to the similarity algorithm model and use this distance as the basis for clustering.Then,based on K-Means clustering algorithm to select the K value,a model based on LDA algorithm combined with K-Means algorithm is proposed.Finally,the above model was applied to the laboratory “Supai Smart Logistics Service Platform” project,and experimental verification was performed based on the large-scale short text information provided by the service platform.The results show that the method has obvious improvement compared to the traditional clustering algorithm.
Keywords/Search Tags:Feature weights, Word vectors, Sentiment analysis, RWMD distance
PDF Full Text Request
Related items