With the wide use of computers and handheld mobile devices,the short text information grows rapidly.That how to quickly and effectively realize the automatic classification of short text is an urgent problem to be solved in the field of information at present.The length of short text is no more than 160 characters.And the short text has the characteristics of real-time,massive,sparse,irregular form of expression as well as the uneven distribution of samples and so on,which make the traditional text classification algorithms could not be demonstrated better in short text.Aiming at the problem of short text classification,this paper makes research on the feature extension and classification algorithms:Firstly,aiming at the problem that the short text sample distribution is not balanced and the traditional text classifier is hard to apply,we propose the algorithms of Bagging_NB and Bagging_BSJ which are based on the idea of ensemble algorithm.According to the main idea of Bagging algorithm,we train the classifier with the weak classifier NB and the algorithm combined by NB,SVM and J48 as the base classifier respectively.This kind of improvement can not only improve the generalization ability of a single classifier,but also avoid the problem of over fitting,turn the weak text classifier to strong one.Experimental results show that the Bagging_BSJ algorithm proposed in this paper has improved the accuracy rate of 12%,the recall rate increased by 28%,and F-measure increased by 20%.Secondly,in order to solve the problem of semantic relations between lexical entries,this paper proposes the semantic similarity calculation method – WLA,which is based on Wikipedia text and link information.Inspired by the method of semantic computation based on Wikipedia explicit semantic analysis and link information,considering the text information and link structure within Wikipedia topic page,the text information and the link information(links-in and links-out)are extracted in WLA,which are used to calculate the semantic similarity respectively,then combine them with different weights as the final quantitative model of semantic.The WLA algorithm proposed in this paper provides a theoretical basis for the extension of the short text features.Finally,in order to solve the problem of the sparse feature of short text,this paper proposes two kinds of short text feature expansion models,which are both based on Wikipedia as the external semantic knowledge base.One is based on the preprocessing of Wikipedia text information.It makes the feature term vectors processed by the text information of Wikipedia as the characteristics expansion word vector of short text.The other is the semantic extension model based on WLA.It computes the semantic similarity between the topic features and each element of Wiki extended vector,then chooses the lexical entry with high semantic similarity as the extended table of topic features.Experimental results show that compared with the original short text data which hasn’t been extended,the two extension models proposed in this paper have a significant effect on accuracy and the recall rate.And the semantic extension performs best.The study of this paper can not only make up the shortcomings of the short text in a certain extent,such as feature sparse and semantic deficiency,but also provide a reference method for public opinion analysis,social instant messaging and other fields.So the methods are valuable in academic and applied fields. |