Font Size: a A A

Study On The Key Learning Technology In Computer-aided Diagnosis For Medical Image

Posted on:2015-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ShenFull Text:PDF
GTID:1264330428459339Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Computer-aided diagnosis(CAD) which use computer technologies to assist radiologiest for diagnosis in decision-making processes can play a key role in the early detection of breast cancer and help to reduce the death rate from female breast cancer. But it is so hard to collect enough cases which are labeled by radiologist in clinic, and moreover, the number of positive cases is always much less than that of negative cases. So there always exists imbalanced and small sample learning in the CAD. The imbalanced and small sample learning problem are concerned with the perfermance of learning algorithms in the presence of severe class distribution skews and underrepresented data respectively. Learning from imbalanced and underrepresented data has great significance in the real world. Although machine learning and data mining techniques have shown great success in many applications, but the imbalanced and small sample learning are still the big challenges to researchers. In this dissertation, the main causes of degradation on the learning perfomace when the training dataset is small and highly imbalanced is explained firstly and then the popular and advanced solutions of this special learning task are systematically reviewed. Fully understanding the shortcomings of common under-sampling methods which give rise to the loss of class information, we focus on how to deal with majority class reasonalbly and in order to solve imbalanced learning problem effectively. Two novel under-sampling methods are proposed in this dissertation to avoid the loss of class information by selecting mostly representative samples effectively.In addition, a novel class-labeling algorithm is also proposed to solve the problem of the small sample learning. This algorithm expands the training dataset by labeling the unlabeled samples automaticly, and moreover, the mistakes of class labeling are decreased effectively.The problems of learning from imbalanced and underrepresented data are studied in this dissertation. We focus on how to deal with imbalanced learning problem by the novel resampling schemes and how to expand the training dataset by the novel class labeling scheme. The following paragraphs overview the contributions of this dissertation.(1)Aiming at dealing with the learning problem resulting from the underrepresented labeled training set in CAD. the proposed scheme in this dissertation is to enlarge the labeled training set by adding pseudo-labeled samples from the abundant unlabeled samples. However the mistakes always occur in the common class labeling algorithms, the samples labeled falsely would degrade the learning peformace as the noises. In order to avoid the labeling mistakes, a novel hybrid class labeling(HCL) algorithm is proposed. The HCL algorithm is formed by three different class labeling schemes from the view point of geometric similarity, probabilistic distribution and semantic concept respectively. There are the distinct differences among these three class labeling schemes which are based on the different principles. Only those unlabeled samples which get the unanimous labeling results from three different labeling schemes are added to the training set. In oder to go a step further in reducing the harmfulness for learning performance resulting from the still existing labeling mistakes, the memberships of pseudo-labeled samples are introduced to SVM in the algorithm. The contribution of pseudo-labeled sample to learning task is determined’ by its membership. Classification experimental results based on Breast-cancer dataset in UCI show that the proposed algorithm is effective to deal with the small sample learning problems and has less mistakes, better classification performance comparing with the other algorithms which adopted the single labeling scheme.(2)To deal with the loss of class information resulting from the common under-sampling methods, a novel under-sampling scheme based on convex hull(CH) is proposed in this dissertation. The convex hull of a dataset is the smallest convex set which contain all data points in this dataset. All data points lie inside the convex polygon or polyhedron formed by its vertices. Enlighted by the geometric characteristics of the convex hull, we try to sample the convex hull from majority class and its vertices are selected to form the reduced training set to balance the training set. In view of the fact that the data points from two classes are always overlapped in real-world applications, the convex hulls of two classes are also overlapped. In this situation, the training set represented by its convex hull is a challenge for learning task which can lead to the overfitting and degradation of generalization ability. Considering that both Reduced Convex Hull(RCH) and Scaled Convex Hull (SCH) would lead to the loss of class information, a novel structure of reduced convex hull, Hierarchy Reduced Convex Hull(HRCH), is proposed.lnspired by the obvious diversity and complementarity between RCH and SCH, we mix RCH and SCH together to build HRCH. By comparison with the other reduced convex hulls, HRCH contains more diverse and complementary class information and effectively alleviates the loss of class information during the reducing process. By choosing different reduced factor and scaled factor, Several diverse HRCHs are acquired from the majority class. Then each HRCH and minority class form a training set. Several learners learning from these training sets are integrated into the final classifier. Classification experimental results reveal that the proposed algorithm has better and more robust classification performance comparing with the other four traditional algorithms.(3)An improved under-sampling algorithm based on reverse k nearest neighbors(RKNN) is further proposed to overcome the loss of class information resulting from the common under-sampling. By comparison with k nearest neighbors(k NN), the RKNN examine the neighborhood globally. The RKNN of a data point is not only concerned with its surrounding points, but also concerned with the other points in the dataset. The change of data distribution can result in the change of reverse nearest neighbors for each point in the dataset. The characteristic of RNN is that the relationship of neighborhood can spread through the dataset. This characteristic overcomes the shortcoming that NN is only concerned with the local distribution. This algorithm trys to find more representative and reliable samples from majority class by removing noisy and redundant majority samples using RKNN, thereby balances the training set and avoids the loss of majority class information. Classification experimental results based on Breast-cancer dataset in UCI show that the proposed algorithm is effective to deal with the class-imbalanced problems and has better classification performance comparing with the scheme of k NN.
Keywords/Search Tags:computer-aided diagnosis, imbalanced learning, small sample learning, reverse k nearest neighbors, under-sampling, convex hull
PDF Full Text Request
Related items