Research And Application Of Class Imbalanced Data Classification Problem

Posted on:2024-02-06

Degree:Master

Type:Thesis

Country:China

Candidate:J T Wang

Full Text:PDF

GTID:2568307121497944

Subject:Materials and Chemical Engineering (Professional Degree)

Abstract/Summary:

PDF Full Text Request

Research on compound molecules plays a crucial role in optimizing processes,advancing the development and application of new materials,and improving production efficiency and product quality.However,in practical production,compound molecules often exhibit imbalanced distribution,with some categories having fewer molecules while others have a larger quantity.This imbalance poses challenges for data analysis and classification tasks.To address this issue,researchers widely employ machine learning algorithms,which have achieved certain results,but still need to tackle urgent problems.At the feature level,uneven sample distribution can impact the transmission of feature information and the recognition capability of minority class samples in classification algorithms.At the algorithm level,traditional methods tend to favor more common categories,thus affecting classification performance.At the data level,traditional balancing methods struggle to fully capture the features of the data.This paper focuses on studying three key problems in class-imbalanced data classification,including the following tasks:1.To address the feature problem in class-imbalanced data,we propose a feature selection algorithm called TWS-LGBM based on the LightGBM feature importance measure.This algorithm utilizes an improved bidirectional search strategy to comprehensively consider the correlation between features and the correlation between features and classes,aiming to obtain the optimal subset of features.Specifically,the algorithm builds upon the LightGBM feature importance measure and combines forward search strategy using mutual information to measure the correlation between features,along with backward search strategy using Pearson correlation to measure the correlation between features and classes.Through this approach,feature information can be effectively transmitted,enhancing the recognition capability of minority class samples in the classification algorithm.2.To address the challenges of class-imbalanced data classification at the data and algorithm levels,we propose a classification algorithm called CTGAN＿LightGBM based on oversampling ensemble learning.This algorithm tackles the data-level issue by utilizing the CTGAN generative adversarial network algorithm specifically designed for tabular data to generate more diverse synthetic samples,thus achieving data balance.At the algorithm level,the proposed approach combines the LightGBM ensemble learning algorithm with an improved grid search algorithm for parameter optimization.This combination improves upon traditional methods by mitigating their bias towards common classes,thereby enhancing classification performance.3.To address the three key problems of class-imbalanced data classification,we propose a CTGAN＿LightGBM classification algorithm based on TWS-LGBM feature engineering.The algorithm is applied to the dihydrofolate reductase(DHFR)inhibitor dataset to validate the effectiveness of our proposed approach.

Keywords/Search Tags:

Feature engineering algorithm, Oversampling algorithm, CTGAN, TWS-LGBM feature selection, LightGBM

PDF Full Text Request

Related items

1	Research On Online Consumers’ Purchasing Intention Based On LightGBM Algorithm
2	Research On Personal Credit Evaluation Model Based On LightGBM Algorithm
3	Improved Methods Of Oversampling And Feature Selection Based On Imbalanced Data
4	Research On New Feature Selection Algorithm
5	A Study On Feature Selection Algorithms Based On Support Vector Machine And Its Application
6	Research On Automated Feature Engineering Algorithm And System For Structured Data
7	Research On E-commerce Purchase Behavior Prediction Based On Feature Selection And Stacking Integrated Algorithm
8	Research On Feature Engineering And Model Generalization Of Credit Default Prediction
9	Automatic Multi Table Expansion Algorithm Based On Directed Graph
10	The Applied Research Of The Filter And Wrapper Feature Selection