Font Size: a A A

Research And Development Of PU Text Classification Based On Semantic Feature Selection

Posted on:2008-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:X TanFull Text:PDF
GTID:2178360212497009Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the quick development of internet information technology and inflation of available text resource, great information management and classification are becoming more and more concerned problems. Text classification is the process that gives a category to each document in the set according to the predefined classifier. So that users can not only browse documents conveniently but also easily lookup documents by limited searching bound. Text classification is the foundation of effective retrieval and a data organization technology of great information database. Exact text classification can greatly improve the speed and precision of retrieval. Automatic text classification will save much manpower and resource, will avoid the limitation of manual classification such as long cycle, costly and inefficient. The research of text automatic classification is significative, so there have been many researchers in this field.Since 1990s, many classification algorithms based on machine learning have been put forward, such as K-NN(nearest neighbor),SVM(support vector machine),neural network and so on. The validity of those algorithms has been proved by some researchers. At the same time, the experiments have been done to compare the performance among K-NN, decision tree, Native Bayes, neural network and so on,totally 14 classification algorithms. The result of experiments indicates that the precision of some algorithms such as k-NN is satisfying. However, the common problem of them is that in the construction of classifier, we need to label great training examples manually. It's easier to get positive examples than negative examples. The collected texts are interested documents of users, while negative set must include all un-positive categories. In the other side, the negative examples we find are deflected, because of our subjective factors, so that they will affect the performance of classifier. Therefore, researchers advanced that we can build a classifier using a few positive and many unlabeled examples, which is called PU problem. So far, there has been a lot of mature PU classification technology. Bing Liu and his team did abundant experiments to check different algorithms based on two steps frame,which show that PU classifier has got a good result. Because the examples aren't labeled class, the normal feature selection methods such as IG(information gain),MI(mutual information) cannot been used for PU classification. The existed PU classifiers almost use the document frequency for feature selection.In conclusion, it is significative to study PU classification. And to improve feature selection is an important way to improve performance of existing PU classification. The work of this paper is just to put forward a new feature selection method for PU problem to improve the effect of PU classification. The new feature selection method put forward in this paper applies some relative technology of natural language processing and ontology.With the development of natural language processing, text mining based on semantic information is becoming a key point of text classification,text retrieval and other correlative fields. Many scholars put forward their own methods for text processing based on concept. These methods are different from traditional methods based on word frequency statistic. The features of documents are expressed by meanings of words. The key point of these methods based on content is the construction and expression of concept structure. There are some common structures: Conceptual taxonomy, Formal or domain ontology, Semantic linguistic network of concept and Thesaurus. However, the shortage of these structures is their complicated construction. The structure used in this paper is WordNet, which is an online semantic lexicon and also a linguistics ontology. Words and phrases are organized in synonymous sets in WordNet. Every synonymous set has an exclusive ID index that can be considered as a meaning concept. In addition, short explanation is available for some synonymous set to make clear its meaning. So far, WordNet has been used to text classification and focused crawling and other correlative fields.With WordNet this paper put forward a PU classification method based on semantic feature selection to improve the performance of existing PU classification. In the second and third chapter, theory background and basic technology for algorithm realization was listed. Firstly, we introduced the basic knowledge of text classification, including text expression, feature selection, classifier evaluation. The experiments in this paper used VSM to denote text documents, adopted semantic feature selection based on ontology and evaluation index F1 which concerns both precision and recall. Secondly, we discussed the two steps frame of PU classification and two algorithms used in our experiments: One-Class SVM and PEBL. The feature selection method of this paper would be applied to two algorithms above and compared with the old ones.In the fourth chapter, we put forward our method for PU, that is, semantic feature selection for PU classification. In our algorithm, the documents were scanned twice. In the first time, we get the semantic meanings of the documents with WordNet, that is, we find the repeated synsets of the documents. In the second time, we filtrate terms without these synsets. After that we reduce the dimensionality and get the text vector. In PEBL , we get vectors of documents in positive set with semantic feature selection, and make positive feature set for choice according to tfidf. The set together with statistical information helps to find strong positive feature for filtrating reliable negative examples. These strong positive features are terms in the positive feature set and have larger frequency in positive set.Finally, we implemented two PU classifiers based on semantic feature selection according to the algorithm introduced in the fourth chapter. The result shows that compared to document frequency method, our algorithm increases the F1 of One-Class classifier of 10.183% for the fewer positive examples case and 1.941% for the more positive examples case, and increases the F1 of PEBL classifier of 0.7389%.In conclusion, the thesis studied PU problem based on semantic feature selection. We get three conclusions by our experiments.1. When the positive examples are few, the semantic feature selection can improve the performance of classifier greatly.2. In PU problem, we can find the representative positive features with semantic feature selection; these features have strong dipartite ability and can improve the final classifier in a certain extent.3. Semantic feature selection reduces the efficiency of the classifier. In the future, we will enhance the efficiency and performance of our algorithm when the training set is huge.
Keywords/Search Tags:Classification
PDF Full Text Request
Related items