Font Size: a A A

Research On Short Text Classification Based On Ensemble Learning

Posted on:2017-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:W S LiuFull Text:PDF
GTID:2428330569998792Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
The widely used of the social network sites such as Twitter,BBS,SNS,Weibo and the instant message applications such as MSN,Skype,QQ,WeChat brings huge amount of short texts.Short texts cover a wide range of topics,contain abundant information,which is important information for governments and commercial organizations.Therefore,how to extract meaningful information from these short texts has become a hotspot in current research.The short texts classification technology plays an important role in short texts data mining scale.However,owning to its natural characters,such as short and feature sparseness,the traditional long text categorization algorithm can't work well on short text classification problem.Therefore,this paper studies the short text feature expansion methods to track the short texts classification problem and uses the popular ensemble learning methods to discuss the method to improve the classification performance of short texts.The main work of this paper includes the following aspects:1.For the problem of short text feature sparseness,we proposed a short text feature extension method based on Wikipedia and Word2 Vec.Firstly,this method obtained a related concept set based on the Wikipedia page structure and link information,and then used the Word2 Vec tool to measure the correlation between the concepts and the topic concept.Finally,extending the short text based on the semantic related concept set.Compared with the traditional methods,which measure the semantic relevance by statistical methods,this method can measure the semantic relevance more accurately.Experiment results show that the performance of short text classification can be improved by feature extension,and the performance of our method is much better.2.By combing the semantic information in the short texts,we proposed a short text feature extension method based on the topic model of LDA.In this method,we firstly get the category high-frequency word sets,then the high-frequency word sets are used as feature set to train the LDA topic model.Finally,we get the topic distribution of the short text according to the LDA model,and get the expanding features,which belong to the topics that higher probability.This method makes full use of the semantic information in short texts and improves the co-occurrence rate of the features.The experiment results show that the short texts feature expansion method based on LDA model can alleviate the short text problem of feature sparseness and improve the performance of short texts classification.3.We proposed the random forest model based on multi-source heterogeneous extended features,which is named as MEF-RF.The method based on Wikipedia uses the external knowledge base to enrich the short text features while increasing the feature dimension and redundant words;the method based on LDA model relieves the problem of short text feature sparseness by the internal theme information,but the performance of topic model is depending on the train set,which will affect the contribution of extension words and leads to the instability of classification performance.What's more,for different categories and different corpus,the classification performance of the above two feature extension methods is different.Therefore,in order to give full play to the advantages of short text feature expansion based on Wikipedia and LDA theme model and combine the ability of integrated learning method to deal with high latitude,redundant data and instability,MEF-RF method is proposed in this paper.The experimental results show that the method of MEF-RF can effectively improve the performance and the generalization ability of short text classification model.
Keywords/Search Tags:Short Text Classification, Feature Extension, LDA Topic Model, Selective Ensemble Learning
PDF Full Text Request
Related items