A Classification Method For PU Problem Based On Data Distribution And Text Similarity

Posted on:2015-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:H J Hu

Full Text:PDF

GTID:2268330431958834

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

PU learning is very similar to the human’s learning process. In the real world, un-labeled data is much more than labeled data. Besides, things can’t always be predicated. Those factors result in that data obtained from real world can easily contain unlabeled data. In the light of traditional classify framework, unlabeled data whose category is unknown will be falsely categorized to known classes. Semi-supervised learning is an effective way to handle PU learning. With the help of a few labeled data and a large num-ber of unlabeled data, semi-supervised learning can detect unknown categories that don’t belong to any pre-defined classes. However, the existing framework of semi-supervised learning ignores the data quality and data distribution of unlabeled data. Nowadays, the emergency of big data bring more chance for duplication of data. The redundancy can cause the bias of classifier. Meanwhile, for lack of deep analysis of unlabeled data, al-1the existing methods can never make good use of the information contained in it. In addition, no method can always achieve good performance in different data distribution. These issues play an important effect on the final result. To solve these problems, we propose a novel framework which develops a general gram filter for redundancy detection and a general method of distribution estimate for unlabeled data. The novel framework can significantly improve the performance. With the help of data distribution, we can integrate different methods so that we can avoid the worst case. The experiments verify the effectiveness of our approach.Main contributions of this paper are as follows:●A novel PU learning framework Different from existing framework, our method introduces the quality control and distribution estimate, which can enhance existing framework significantly.●General gram filter In order to implement efficient quality control, this paper de-velops a general gram filter to remove the redundancy. The general gram filter can unify existing filters and achieve better performance.●Proportion estimate This paper first proposes a general method for distribution estimate. After obtaining proportion knowledge, existing method can be improved. Besides, different PU learning approaches can be integrated together.

Keywords/Search Tags:

PU Problem, Semi-supervised Learning, Similarity Search, Distribu-tion Estimate

PDF Full Text Request

Related items

1	Robust Semi-supervised Classification Method Search For Noisy Labels Based On Self-paced Learning
2	Semi-supervised Metric Learning Based Anchor Graph Hashing For Large Scale Image Retrieval
3	Research Of Reliable Semi-supervised Classification
4	Semi-Supervised Learning for Scalable and Robust Visual Search
5	Research On Semi-supervised Clustering And Classification Algorithm
6	Structure Semi-Supervised Learning And Its Application
7	Research On The Application Of Geometric Information In The Semi-supervised Learning
8	Research On Adaptive Selection Of Distance Metric Functions In Semi-Supervised Classification
9	Graph-based Semi-supervised Learning With Adaptive Similarity Estimation
10	Research On The Application Of Semi-supervised Learning In Natural Language Processing