Text classification on imbalanced data: Application to systematic reviews automation

Posted on:2008-01-03

Degree:M.C.S

Type:Thesis

University:University of Ottawa (Canada)

Candidate:Ma, Yimin

Full Text:PDF

GTID:2448390005962749

Subject:Computer Science

Abstract/Summary:

Systematic Review is the basic process of Evidence-based Medicine, and consequently there is urgent need for tools assisting and eventually automating a large part of this process. In the traditional Systematic Review System, reviewers or domain experts manually classify literatures into relevant class and irrelevant class through a series of systematic review levels. In our work with TrialStat, we apply text classification techniques to a Systematic Review System in order to minimize the human efforts in identifying relevant literatures. In most cases, the relevant articles are a small portion of the Medline corpus. The first essential issue for this task is achieving high recall for those relevant articles. We also face two technical challenges: handling imbalanced data, and reducing the size of the labeled training set.;To address these issues, we first study the feature selection and sample selection bias caused by the skewness data. We then experimented with different feature selection, sample selection, and classification methods to find the ones that can properly handle these problems. In order to minimize the labeled training set size, we also experimented with the active learning techniques. Active learning selects the most informative instances to be labeled, so that the required training examples are reduced while the performance is guaranteed. By using an active learning technique, we saved 86% of the effort required to label the training examples. The best testing result was obtained by combining the feature selection method Modified BNS, the sample selection method clustering-based sample selection and active learning with the Naive Bayes as classifier. We achieved 100% recall for the minority class with the overall accuracy of 58.43%. By achieving work saved over sampling (WSS) as 53.4%, we saved half of the workload for the reviewers.

Keywords/Search Tags:

Review, Class, Data, Active learning, Sample selection

Related items

1	Study On Key Technologies Of Active Learning In Division Classification Model
2	Research On Active Learning Method Based On Rough Set Theory
3	Active Sample Selection Algorithm And Its Application In Face Detection
4	Big Data Active Learning Based On Open Source Frameworks
5	The Study And Improvements Of Uncertainty-based Sample Selection
6	Research And Application Of Network Intrusion Detection Technology Based On Active Learning Support Vector Machine
7	Two-class Imbalanced Big Data Classification Based On Data Reduction And Ensemble Learning
8	Research On Online Active Learning For Class-imbalanced Data Stream
9	Research On Soft Sensor Modeling Based On Active Learning
10	Research And Application Of Sample Selection In Machine Learning