Font Size: a A A

A Classification Method For PU Problem Based On Data Distribution And Text Similarity

Posted on:2015-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:H J HuFull Text:PDF
GTID:2268330431958834Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
PU learning is very similar to the human’s learning process. In the real world, un-labeled data is much more than labeled data. Besides, things can’t always be predicated. Those factors result in that data obtained from real world can easily contain unlabeled data. In the light of traditional classify framework, unlabeled data whose category is unknown will be falsely categorized to known classes. Semi-supervised learning is an effective way to handle PU learning. With the help of a few labeled data and a large num-ber of unlabeled data, semi-supervised learning can detect unknown categories that don’t belong to any pre-defined classes. However, the existing framework of semi-supervised learning ignores the data quality and data distribution of unlabeled data. Nowadays, the emergency of big data bring more chance for duplication of data. The redundancy can cause the bias of classifier. Meanwhile, for lack of deep analysis of unlabeled data, al-1the existing methods can never make good use of the information contained in it. In addition, no method can always achieve good performance in different data distribution. These issues play an important effect on the final result. To solve these problems, we propose a novel framework which develops a general gram filter for redundancy detection and a general method of distribution estimate for unlabeled data. The novel framework can significantly improve the performance. With the help of data distribution, we can integrate different methods so that we can avoid the worst case. The experiments verify the effectiveness of our approach.Main contributions of this paper are as follows:●A novel PU learning framework Different from existing framework, our method introduces the quality control and distribution estimate, which can enhance existing framework significantly.●General gram filter In order to implement efficient quality control, this paper de-velops a general gram filter to remove the redundancy. The general gram filter can unify existing filters and achieve better performance.●Proportion estimate This paper first proposes a general method for distribution estimate. After obtaining proportion knowledge, existing method can be improved. Besides, different PU learning approaches can be integrated together.
Keywords/Search Tags:PU Problem, Semi-supervised Learning, Similarity Search, Distribu-tion Estimate
PDF Full Text Request
Related items