Font Size: a A A

Short Text Classification Based On SVM And Semi-supervised Learning

Posted on:2018-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:J XiangFull Text:PDF
GTID:2348330518998082Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the coming of the Internet era, a lot of text data on the Internet show explosive growth, and short text has gradually become the mainstream. In the face of a large number of short texts, how to obtain the useful information efficiently is now a new hotspot in data mining, so a new effective short text classification algorithm is needed to extract useful information in short text.However, some previous text classification algorithms are used to classify long texts, such as KNN, SVM, NB and so on. Due to the short text's characteristics of real-time, sparsity and irregularity, those algorithms cannot do the short text classification directly. So we need a suitable short text classification algorithm for data mining. In this paper, main work and study are as follows:First of all, we proposed a pre-treated semi-supervised learning method based on self-training. Before we classify the short text, we need to preprocess the collected data to remove some noise interference. By training the training set, the unlabeled samples are classified and studied until all the samples are labeled. This method can effectively solve the problem that the effect of pretreatment is not ideal when the noise samples are insufficient.Secondly, we proposed an auto selected short text feature extension method based on semi-supervised learning and search engines. To solve the auto selected short text feature extension method's ignorance of the irregular problem, iterative training of semi-supervised learning and huge knowledge base of search engine are introduced.At last, we proposed a short text classification algorithm based on SVM and semi-supervised learning. Normal SVM is not very suitable for the short text classification, but the algorithm proposed in this paper solved the problem that the short text is feature sparse and irregular, extended the number of labeled samples in the data dictionary. And then on the basis of SLAS, we proposed a short text classification algorithm based on SLAS-C. This algorithm combined the classification and regression tree, improved classification model using Gini index,and solved the problem of classification efficiency of the SLAS algorithm. The result shows that F1 of the algorithms proposed in this paper improved 4%-10%, and the efficiency of the algorithms have also been improved.
Keywords/Search Tags:short text classification, semi-supervised learning, SVM, data mining
PDF Full Text Request
Related items