Research on compound molecules plays a crucial role in optimizing processes,advancing the development and application of new materials,and improving production efficiency and product quality.However,in practical production,compound molecules often exhibit imbalanced distribution,with some categories having fewer molecules while others have a larger quantity.This imbalance poses challenges for data analysis and classification tasks.To address this issue,researchers widely employ machine learning algorithms,which have achieved certain results,but still need to tackle urgent problems.At the feature level,uneven sample distribution can impact the transmission of feature information and the recognition capability of minority class samples in classification algorithms.At the algorithm level,traditional methods tend to favor more common categories,thus affecting classification performance.At the data level,traditional balancing methods struggle to fully capture the features of the data.This paper focuses on studying three key problems in class-imbalanced data classification,including the following tasks:1.To address the feature problem in class-imbalanced data,we propose a feature selection algorithm called TWS-LGBM based on the LightGBM feature importance measure.This algorithm utilizes an improved bidirectional search strategy to comprehensively consider the correlation between features and the correlation between features and classes,aiming to obtain the optimal subset of features.Specifically,the algorithm builds upon the LightGBM feature importance measure and combines forward search strategy using mutual information to measure the correlation between features,along with backward search strategy using Pearson correlation to measure the correlation between features and classes.Through this approach,feature information can be effectively transmitted,enhancing the recognition capability of minority class samples in the classification algorithm.2.To address the challenges of class-imbalanced data classification at the data and algorithm levels,we propose a classification algorithm called CTGAN_LightGBM based on oversampling ensemble learning.This algorithm tackles the data-level issue by utilizing the CTGAN generative adversarial network algorithm specifically designed for tabular data to generate more diverse synthetic samples,thus achieving data balance.At the algorithm level,the proposed approach combines the LightGBM ensemble learning algorithm with an improved grid search algorithm for parameter optimization.This combination improves upon traditional methods by mitigating their bias towards common classes,thereby enhancing classification performance.3.To address the three key problems of class-imbalanced data classification,we propose a CTGAN_LightGBM classification algorithm based on TWS-LGBM feature engineering.The algorithm is applied to the dihydrofolate reductase(DHFR)inhibitor dataset to validate the effectiveness of our proposed approach. |