| Genomic prediction(GP)utilizes genome-wide DNA markers for the target of accurate phenotype prediction and is a promising route to accelerate plant breeding with the data-driven paradigm.GP has been used to predict phenotype of animal and plant,not only helped our understanding of the genetic basis of complex phenotype,but also helped us to select potentially dominant varieties at an early stage and shorten the breeding cycle in breeding practice.Research on the application of GP in rice,maize,wheat and other crops has been gradually carried out both domestic and overseas,and it has been used in many aspects such as hybridization prediction,inbred line prediction and parental selection.Existing GP models include genetic statistical methods such as least absolute shrinkage and selection operator(LASSO),genomic best linear unbiased prediction(GBLUP),Bayesian regression,and machine learning methods such as random forest(RF),support vector machines(SVM)and deep learning(DL).Most GP models suffer from two common problems.First,the ability to capture higher-order information about interactions between loci is inadequate,because besides DNA marker,genetic information is also associated with traits through multiple intermediate processes,including transcription,translation and metabolism.This complexity obscures the direct connections between genetic information and traits and may make it difficult for GP to learn information about interactions between loci.The second problem is that the analysis of the biological interpretability of these GP models is not thorough enough,that is,which loci contribute more in predicting phenotypes.Although GP model has achieved good prediction effect,the biological principle behind it has not been figured out clearly.It remains to be revealed exactly how GP models use the information available from each omics,and what intermediate omics pathways are involved from DNA markers to phenotypes.To solve the first problem,this study proposed a new directed learning architecture(DLA)based on the successful integration of genomic and transcriptome information with MLLASSO(Multilayered least absolute shrinkage and selection operator)and the advances directed through integrating sub-trait in GP.By integrating genome,transcriptome and subtrait information of CUBIC population of maize into a single GP model,a model with a three-layer LASSO structure(SNP to transcription to sub-traits to EW)was constructed to predict the yield trait of maize(Ear weight,EW).The result showed that yield prediction of the CUBIC population achieves twice improvements with successive additions of transcriptomic layer and sub-trait layer not only in the training dataset,but also in the independent testing dataset.The PCC(Pearson correlation coefficient)in training set was increased from 0.4090 to 0.5230 and PCC in the independent test set from 0.3079 to 0.3409.This implied that: First,the more integrated omics levels,the greater the gain in the prediction performance of yield;Second,transcriptome and sub-trait provided beneficial help for yield prediction.It is worth noting that the number of transcripts used is not the more the better when constructing DLA model.If poorly predictive transcripts are also used for model construction,excessive noise may result.Therefore,we record the transcripts with good predictability as “Genetically predictable genes”(GPGs).Only by selecting an appropriate subset of GPGs to construct model can the predictive power of the model be optimized.This showed the significance of GPG in predicting phenotype.For the second question,we used the reverse backtracking strategy based on DLA to identify 1,595 yield-related genes,which named “EW-related GPG genes” to explore their biological interpretable properties.Firstly,functional analysis showed that these genes were enriched in transcription factor(TF)and transposable element(TE),suggesting that they have potential transcriptional regulatory functions.Among them,TF enrichment analysis showed that three TF families,b HLH,WRKY and CO-like,were enriched,suggesting that some genes may indirectly regulate plant yield by regulating sub-traits such as plant development,stress resistance and flowering time.Secondly,we also discussed the potential application of DLA in detecting yield-related minor-effect genes,which can complement GWAS for genes associated with target traits that are filtered due to their small effects by Bonfferoni correction.Finally,according to the comparison with the “omnigenic model”,we explained the regulatory networks among some GPG genes,sub-traits and EW.The analysis indicated that by using DLA,we can search for peripheral genes that play a major role in the overall gene model,so as to help complete “missing heritability”.This method also makes it possible to extend DLA to GP research on more complex traits of crops,and expands its application boundary.In summary,by DLA strategy integrated multi-omics,we solved the deficiency of information capture ability between loci in GP model to some extent,and the prediction accuracy of the model was improved;at the same time,the function of GPG was discovered through the backtracking strategy,which made up the deficiency of the biological interpretability of GP model.DLA is expected to provide a model with stronger predictive power for GP research,accurately predict the target economic traits by using genotypes,accelerate the process of genetic improvement in crop breeding in breeding practice,and dig the important influence loci of target traits to guide the research direction of its molecular biological mechanism.However,we should also note two disadvantages of DLA:First,the measurement cost and accuracy of multiple omics data may limit the application of DLA;Second,the predictability of this model still has room for improvement.In the future,the back propagation technology in deep learning may be applied to the training process to reduce training error and further improve the predictability of the model. |