Font Size: a A A

Semi-supervised Learning On Text Data

Posted on:2013-07-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:1228330395967918Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and storage technology, there are more and more disordered text data. In order to obtain useful information from such data, people need text classification technology to organize text data efficiently. Traditional classification technology includes supervised classification and unsupervised clustering. Supervised classification learns from a large number of labeled data. However, it is time-consuming to label text data in a large-scale. Meanwhile, the performance of unsuper-vised clustering needs to be improved due to the lack of labeled data. In such case, semi-supervised learning, which learns from very few labeled data and a large number of unlabeled data, emerges and attracts people’s attention. This dissertation engages in aca-demic research on text labeling, text representation and semi-supervised learning model designing. Our innovations are mainly reflected as follows:(1) As it is time-consuming to label text data, this dissertation discusses how to select text data for labeling and how to label the selected data reasonably. In order to make the distribution of the labeled data more consistent with the distribution of the orig-inal data, a sampling method is proposed to avoid selecting the K nearest neighbors of the labeled data to be the new labeled data. With the help of this method, the data located in various regions will have more opportunities to be labeled. When labeling the selected data, we consider the category information contained in the words of the documents, and mark some keywords of a document when we assign the document a label, we can easily obtain the keywords for every category. Then, we can obtain the labels of some unlabeled data if the unlabeled data are matched with the category keywords. Such labels can be considered to be the additional supervised information.(2) Through academic investigation, it is found that most noise words are uniformly distributed. Thus, this dissertation proposes a new term weighting method tf.sdf. The proposed method has the ability to emphasize the importance of terms that are unevenly distributed among all the classes, and weaken the importance of terms that are uniformly distributed. In other words, this method can reduce the bad effect of the noise words. Besides, in order to represent the text data with very few labeled data, this dissertation combines tf.sdf with the base classifier, and proposes a new semi-supervised learning framework with simultaneous text representation. The reasonable text representation can improve the performance of the classification process. The classification results can make the text representation more proper.(3) Considering that different kinds of pairwise constraints play different roles in non-negative matrix factorization (NMF), this dissertation proposes a constrained NMF method. In the new method, must-link constraints are used to control the distance of the data in the compressed form, and cannot-link constraints are used to control the encoding factors. Experimental results on real world text data sets show the good performance of the proposed method.(4) In order to enlarge the applications of the NMF method, this dissertation proposes a novel NMF method based on the similarity matrix. The proposed method makes use of the prior knowledge in the forms of pairwise constraints to guide the de-composition process. Theoretical analysis proves the convergence of the method. Besides, as similarity matrix factorization has wider application than traditional nonnegative matrix factorization, we test the proposed method on general UCI data sets, text data sets and social network data sets. Experimental results indicate that the proposed method is effective.
Keywords/Search Tags:Semi-supervised learning, Text labeling, K nearest neighbors, Vectorspace model, Term weighting, Nonnegative matrix factorization, Pair-wise constraints, Multi-type penalties, Semi-supervised clustering
PDF Full Text Request
Related items