| With the rapid development of data collection and storage technologies,researchers today have access to vast amounts of complex and high-dimensional data,where the intricate relationships between these data points have become increasingly intricate.Against this backdrop,the significance of feature engineering has grown more pronounced.By selecting,transforming,and constructing features from raw data,feature engineering contributes to enhancing the predictive performance and generalization ability of machine learning models,allowing for efficient utilization of the data.Feature construction algorithms play an important role within the domain of feature engineering.These algorithms extract and construct features from raw data,enabling these features to carry richer information and possess greater discriminative power,thereby bolstering the predictive capabilities of machine learning models.Feature construction algorithms often work with feature selection,ensuring that the chosen subset of features achieves optimal model performance.Nevertheless,when dealing with high-dimensional data,indiscriminate feature construction might lead to the curse of dimensionality.Consequently,integrating domain-specific expertise becomes an indispensable aspect of feature construction.By introducing prior knowledge,one can guide the direction of feature construction,confining the scope of feature generation.This approach not only taps into the underlying structure of the data but also enhances the model’s robustness and reliability.Simultaneously,changes in feature correlations across different categories necessitate attention.Feature construction algorithms can quantitatively measure these differences in feature correlations,furnishing valuable insights for downstream predictive tasks in machine learning.The rapid development in both computer science and the field of biology has led to bioinformatics becoming a multidisciplinary and cutting-edge research domain.Bioinformatics involves the extensive analysis of biological data,such as genomes,transcriptomes,and proteomes.Moreover,within living organisms,molecules exhibit intricate interconnections.When physiological or pathological changes occur,they often result from the synergistic actions of multiple molecules.Thus,omics data possess high dimensionality and complex associations.In this context,feature construction algorithms can assist in translating complex biological data into more interpretable features.These features can aid in identifying potential biomarkers,which are invaluable for disease diagnosis and treatment responses prediction.Therefore,this paper takes a perspective that integrates knowledge from the field of bioinformatics.It regards genes as features and,through quantifying the differences among gene associations,extracts essential information from the data to assist downstream predictive tasks.Additionally,it summarizes the concept of capturing distinct inter-group feature associations as feature construction algorithms.The research work of the paper can be summarized as follows:1.Feature Construction Algorithm Based on Regulatory DifferencesIn the current differential regulation analysis,the synergistic correlation information of multiple genes is inadequately utilized during the prediction stage,and there is insufficient attention to non-differentially expressed genes.To overcome these issues,this paper proposes a model-based quantitative transcription regulation description model(mq Trans).First,a regression model is used on reference samples to model the regulatory relationships between transcription factors and target genes.Then,the discordance distance of each sample’s regulatory relationship is quantified under other phenotypes,defined as features constructed by mq Trans.In the prediction stage,mq Trans features are used for differential regulation analysis of whether lymph nodes undergo distant metastasis.Experimental results demonstrate that the algorithm detects dark biomarkers with statistically significant regulatory relationships,even when the original values are not differentially expressed.Furthermore,gender-specific model is conducted for colorectal cancer dataset,and survival analysis results indicate that the mq Trans features are associated with survival and exhibit gender specificity.Finally,by analogizing core features to regulatory factors and minor features to regulated target genes,a feature construction framework is designed for highdimensional imbalanced small samples.This algorithm is compared with 6 algorithms on 15 datasets,and the experimental results show that the constructed features can improve model predictive performance in terms of AUC and G-mean.2.Feature Construction Algorithms Based on Class-Specific Subspace SpecificityCorrelations often exist between features,and existing feature selection or extraction methods primarily consider pairwise feature correlations using distance metrics or information entropy.However,these methods tend to overlook the information provided by the correlations among multiple features while reducing feature redundancy.Therefore,this paper proposes a feature construction algorithm based on class-specific subspaces specificity,FCS3.Firstly,a regularization-based self-representation method is employed to uncover the relationships between features.Features exhibiting differential representations between different classes are chosen as seed features,and based on their correlation strength,features are grouped,with each group representing a subspace.Principal Component Analysis(PCA)is then employed to obtain orthogonal transformation matrices for each class’ s subspace.Finally,the original features are orthogonally transformed within each class’ s subspace,mapping the raw data into feature spaces that better capture class characteristics.This process is complemented by Fisher’s feature selection method to select an optimal subset of features with superior classification performance.Experimental results conducted on 8 publicly available datasets demonstrate that the proposed algorithm outperforms 6 benchmark algorithms in terms of classification performance.3.Algorithm for Cancer Stage Biomarker Detection Based on Healthy ControlsExpanding upon the evolution relationship between healthy and cancer states,where molecular dysregulation leads to pathological or physiological changes in organisms,we extend the mq Trans model.In this experiment,we first simulate the processing differences of data from the same sample on different platforms.Through data augmentation,we obtain 929 samples of healthy blood tissue on the same platform.Next,we use GRU networks to learn the regulatory relationships at the transcriptional level in the samples,serving as the feature representation of the healthy state.Finally,we quantify the dysregulated relationships in early and latestage cancer phenotypes compared to the healthy state to construct features,obtaining biomarkers representing regulatory changes.The experiment validates the effectiveness of data augmentation in improving the performance of the regression model.Additionally,comparative experiments were conducted on colon and gastric cancer datasets from TCGA.The results demonstrate that the constructed features can enhance predictive performance in terms of both AUC and accuracy.Moreover,several biomarkers were identified to have significant contributions to improving predictive performance,offering guidance for future wet-lab experiments.4.Multi task regulation differential representation algorithm applied to survival predictionIn tasks involving the construction of disease prediction models based on gene correlation networks,existing methods often build separate models for different tasks,leading to the issue of overfitting.Furthermore,there is a lack of a pre-trained model to be utilized for downstream predictive tasks.This algorithm introduces a health pretraining model and a multi-task survival prediction model based on dysregulation quantitative description(DQSurv)to address these challenges.To begin,the algorithm employs a graph convolutional model to train a regulatory model between transcription factors and target genes using healthy samples from the GTEx database.Subsequently,the attention weights of the self-attention network layer in the health model are transferred to the predictive task of cancer samples.This transfer captures differences in hidden layer features during network training and facilitates the learning of gene correlations.Finally,the prediction of target gene expression in cancer samples is employed as an auxiliary task to assist the main survival prediction task.The experiment demonstrates the supportive role of long non-coding RNAs as regulatory factors and underscores the effectiveness of feature construction based on differences in characteristics from two data domains when an evolutionary relationship is maintained between source and target data domains.Comparing against 7 survival prediction algorithms and 6 gene expression prediction algorithms across 10 datasets,the proposed algorithm showcases superior performance in both tasks.In summary,this paper starts from the changes in the correlation between features,introduces background knowledge from bioinformatics,and proposes a series of algorithms that quantitate differential regulatory relationships under different phenotypes to enhance the detection of dark biomarkers and cancer prediction performance.The effectiveness of these algorithms is validated through experiments.Furthermore,this approach is extended to feature construction algorithms for structured data,and the advantages of the algorithms is demonstrated on public datasets. |