Font Size: a A A

Research On Short Text Categorization Based On Semi-Supervised Learning

Posted on:2016-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuanFull Text:PDF
GTID:2308330470976873Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of instant messaging and Internet technology, the information on the Internet is growing at a certain rate every day. It is easy to be found in daily life such forms like:network news, microblog and comments, talk records, short message from mobile phone, summaries of technology literature, the results returned search engines and the post and reply from community forums, etc.These kinds of texts are usually controlled at 160 words approximately. There are many kinds of styles they show to us, and they often appear in colloquial or irregular forms like something we use in daily life. Less feature words, weak correlation are the main characteristics of such kind texts, and in which may hide some valuable information content. So it turns to be very necessary to give an effective classification of short texts. Most traditional text classification methods are based on long text and regard them as an object of study. If these methods are applied to short texts directly,which will affects the results of text classification. In addition, the labeled samples for building traditional text classifier should be obtained and tagged by the artificial usually, and this work is not only time-consuming and laborious, but also easy to form a bottleneck of tagging. On the contrary, the number of the samples without tag is of quite huge, and the samples’ acquisition and collection are relatively easy.The most traditional classification methods based on supervised learning just used the labeled parts of the collected data samples, they did not notice the presence value of unlabeled samples so that something useful may be failed to dig out form the hidden information. But the semi-supervised learning method is possible to combine a small amount of labeled samples and a large number of unlabeled samples together to train for a classifier, so that the unlabeled samples can be fully utilized to achieve effective in improving the performance of text classification. So such methods have been concerned gradually.Aiming at the unique language characteristic of short texts, the key technologies of short text classification to be dealt with are thoroughly. The main contributions of this dissertation are summarized as follows:In this paper, we summarize the meaning of short document classification and the related researches at home and abroad by consulting documents. Then formulatethe semi-supervised learning, the characteristics of short document, the research field of short document classification, the learning algorithm of some kinds of semi-supervised learning algorithm based on graph, the evaluation parameters of document classification, etc.Here we propose a limited constraints selection algorithm. In this algorithm,similarity matrix is used to calculate the probability transferring matrix, then the limited scope of constraint is calculated across to probability transferring matrix.Make sure the limited scope of Vertex Ticks, which avoids the waste of time and efficiency in exhaustion.A semi-supervised learning is proposed which uses limited constraint to label propagation. Then we effectively communicate label combine with label propagation.The algorithm in this paper avoids classification inaccurate causing by single propagation path and it has a better classification in large scale data.The algorithm is proved available in short document. First we use Feature selection approach which is based on fuzzy entropy to Feature extraction in short document. Then we classify the text with semi-supervised learning using limited constraints label propagation.According to the experiment result, we can find that the algorithm has a better classification performance in robustness and has a certain practicality.
Keywords/Search Tags:Short Text Categorization, Semi-Supervised Learning, Label propagation, Limited constraints, Robust
PDF Full Text Request
Related items