With the rapid development of the computer, sensor, and communication technology,Internet, the global information of cyber resources, has become increasingly important and flooded into everyone’s work and life. As the main carrier of information, the text data have therefore grown explosively. Aiming to obtain useful knowledge from huge amount of documents, text mining has become a hot research topic. Text classification and summarization are two important branches of text mining, and have been successfully applied in many fields, including network monitoring, spam filtering, and information retrieval.Existing text classification algorithms require as input fully labeled data. In many real-life applications, however, labeled text is rare or expensive while unlabeled and positive examples are easily acquired. Thus, it is necessary to study learning algorithms using only positive and unlabeled text examples.This thesis focused on classifying text data with only positive and unlabeled samples.Additionally, we studied the problem of multi-document summarization. The main contributions of the dissertation are summarized as follows:(1) We studied two-step strategy based Positive and Unlabeled(PU) learning algorithms for text classification. To solve the problem that in the first step existing algorithms could not extract sufficient reliable negative examples from unlabeled ones, we proposed a probability density estimation based extraction method, and a new PU learning algorithm for text classification. On the basis of the main idea that positive examples and negative ones should share features as little as possible, we designed our density estimation based method to extract reliable positive and negative examples as more as possible. Extracted negative examples and positive data in original dataset were used to learn classifiers. Experimental results on real-life datasets showed that, comparing with the state-of-the-art PU learning algorithms, our algorithm can enhance the effectiveness, especially when the distribution of the input examples is imbalanced.(2) We studied Bayesian learning model based PU learning algorithm for text classification. Existing Bayesian model based PU learning algorithms require priorprobabilities of positive samples, which are either given by users or estimated by assumption of selection completely at random. In this thesis, we proposed an Expectation Maximization(EM) based Bayesian learning algorithms for classifying positive and unlabeled text. The proposed algorithm modeled the generation of examples as a stochastic process, and estimated the maximum likelihood estimation values of corresponding parameters by EM algorithms.The estimated values of parameters were used to construct classifiers, and thereby classify an unknown text. Our algorithm requires no prior probability of positive examples. Experimental results on real-word datasets showed that the performance of the proposed algorithm is comparable to that of state-of-the-art PU learning algorithms in most cases.(3) We studied PU learning algorithms for classifying networked data. Handling networked data requires considering inner features, as well as linking features between texts.Existing PU learning algorithms for classifying text data performs quite well only on dataset with sufficient labeled examples. To solve this problem, we proposed a puNet method, for classification of networked data where only a small amount of positive examples and a large amount of unlabeled examples are available. Our algorithm extends the non-negative matrix factorization(NMF) approach, which is purely unsupervised framework, to a networked setting. By simultaneously factorizing the instance-feature matrix of the nodes as well as the topological network structures, and encoding the small amount of supervised label information into the learning objective function via a consensus principle, we derive an iterative algorithm to address the positive unlabeled learning problem under the networked setting. Experimental results on benchmark networked datasets validates our algorithm and demonstrates the improved performance gain over state-of-the-art algorithms.(4) We studied Multi-document summarization algorithms. Multi-document summarization aims to produce a concise summary that contains salient information from a set of source documents. In this field, sentence ranking has hitherto been the issue of most concern. Most existing summarization algorithms only use inner features in each single sentence, but ignore linking features between sentences. In this thesis, we proposed a novel graph-based ranking approach for multi-document summarization. The approach used sentence-term and term-term relationships besides sentence-sentence ones. Experimental results on the DUC datasets and TAC datasets demonstrated effectiveness of our approach. |