Font Size: a A A

Research And Realization Of Chinese Short Text Classification Based On Machine Learning

Posted on:2017-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:X HuangFull Text:PDF
GTID:2278330485495692Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Web2.0 technologies and social media,Internet users continue to increase.In recent years,individual microblog,customer comments based on internet,etc. Chinese short text messages into the period of explosive growth,and has become an important way of information dissemination to promote the fast and efficient dissemination of the societyinformation. However, a variety of information also to the daily lives of human inconvenience, complex information will waste the majority users’ time to find it, Text categorization to solve this problem has helped, according to the user’s needs, the information clutter to facilitate its users to quickly locate.In this paper, to achieve the automatic classification of Internet short text.Classify the mixed raw materials quickly by short text classification system,to lay the foundation for the follow-up study of the Internet short text field and the field of opinion mining.According to the characteristics of Chinese short text characteristics,this article explores the different characteristics of machine learning framework text representation and classification methods on short text classification.In detail, the main research content contains the following two part:(1) Chinese short text classification problems based on traditional bag of words model. In the framework of the support vector machine, the model expressed bag of words and use classical methods of feature selection and weight calculation method for text. Explore the effects of different feature selection methods for text classification, feature selection which includes document frequency, information gain, chi-square statistic, mutual information. And compared with the classification model based on LDA. Experimental results show that the effect of the traditional bag of words model in general, where the chi-square statistic feature selection method effect more prominent. LDA model based on short text classification has been greatly improved in its results.(2) Chinese short text classification problems based on word embedding. For short text feature sparse features, to further explore the use of word embedding representation of text. And on this basis, comparing the effects of three different sentences vector fusion method of classification results, Including sentence embedding fusion based on the word embedding pooling, sentence embedding fusion based on the PV-DM model,sentence embedding fusion based on the word embedding connect, Experimental results show that the effect of short text categorization based on the word embedding representation is superior to the traditional bag of words based on short text classification model, three of the fusion method based on fusion PV-DM model more prominent, and achieved good results.
Keywords/Search Tags:Short text classification, Machine learning, Bag of words model, Latent Dirichlet Allocation, Word embedding
PDF Full Text Request
Related items