| With the rapid development of computer technology and the completion of the Human Genome Project,the rapid diagnosis and treatment of diseases and the development of effective drugs have entered the fast lane.Computer-aided drug design,high-throughput screening,and biochips have been gradually introduced into the new drug development process,in which the virtual technique based on molecular docking technology is a typical representative of computer-aided drug design,virtual screening technology is adopted by most companies because of its good universal applicability,but there are also its own limitations,in the most widely used molecular docking-based virtual screening In the most widely used molecular docking-based virtual screening technology,the accuracy of the scoring evaluation function is not high,so the quality of the lead compounds of virtual screening is not high,which is a major problem that hinders the development of virtual screening,and the lack of relevant theory makes it difficult to find the scoring evaluation function with high accuracy.At the same time,the data set generated in virtual screening experiments is one in which there is a large order of magnitude difference between active and inactive compounds,which means that the problem of data imbalance arises.The imbalance of the data set will lead to a more negative screening result,which will make the screening results less accurate.Based on this background,this paper proposes a virtual screening method based on imbalanced data mining,which introduces the classification method of imbalanced data sets into the processing of virtual screening data and improves the traditional virtual screening technique accordingly.Firstly,in order to address the traditional molecular docking-based virtual screening technology,the accuracy of the scoring function is not high and the docked conformation is misscoring and misjudging,based on this paper,we propose to use TIFP protein-ligand interaction fingerprint to encode the binding mode and interaction between target protein and ligand instead of scoring function evaluation,which will improve the accuracy and applicability of different scoring functions.The interaction fingerprinting method encodes multidimensional data into one-dimensional data,which is helpful for subsequent data pre-processing and data classification prediction,and will greatly improve the accuracy of molecular docking conformation.Second,this paper proposes a heuristic oversampling method based on K-Means and SMOTE to process the imbalance data generated during the virtual screening process accordingly.The processed data,while the imbalance ratio is reduced,will also greatly preserve the information of positive example samples with small order of magnitude.Also at the data classification level,in order to improve the accuracy of the virtual screening process,a particle swarm optimization algorithm is proposed to continuously optimize the penalty parameters of the support vector machine classifier and the Gaussian kernel radius to find the global optimal solution,and the concept of integrated learning is introduced to the classification.The classification technique based on particle swarm optimization uses support vector machines and adaptive enhancement techniques to screen molecular docking conformations and improve the accuracy of prediction.Finally,for the experimental validation and analysis stages of virtual screening,the PDB database was fully utilized in the construction of the data set to select representative target protein crystals,while the ligand small molecules were obtained from PubChem and other databases,etc.The final experimental results show that the proposed series of processing methods for virtual screening can effectively improve the accuracy of protein screening and have practical guidance for new drug development.In the final experimental results,it is shown that the proposed virtual screening methods can effectively improve the accuracy of protein screening,which is useful for new drug development.This study treats virtual screening as a problem of unbalanced data classification,which has obvious guiding significance and also provides some reference for the problems faced by virtual screening technology. |