Pumpkin seeds,rich in amino acids,proteins and micro-element,have high nutritional and research value,which present great application prospects for the current market.Increasing its yield is important because of an increased demand.Its diversity characteristic information was significant for the identification of properties.However,traditional manual detection methods and machine vision detection are time-consuming and laborious.Spectral imaging technology can simultaneously characterize the external morphology and internal quality information of the detected objects,which can realize detection of the biodiversity of pumpkin germplasm resources rapidly and efficiently.In this paper,using spectral imaging technology and gene sequencing technology,phenotypic characteristics and genomic data of many varieties of pumpkin seeds would be studied based on its atlas information.Several algorithm models would be built from varietal purity prediction to the regression analysis of chemical composition to the combined gene sequence data research.Specific content is as follows:(1)Building identification model of 75 varieties of pumpkin seeds based on atlas fingerprint.All atlas information of sample were extracted through hyperspectral imaging and terahertz spectral imaging techniques,Liquid chromatography-mass spectrometry.Four different grain identification model:LDA,SVM,ELM and CNN would be constructed based on four different phenotypic information:external physical characteristics,visible spectra and near-infrared reflectance and the terahertz transmittance spectra.We compared the results of four identification model and found:Of the three spectral fingerprints,the identification accuracy based on terahertz fingerprint is up to 99%,which can be more effective in the identification of sample varieties;Of the four discrimination models,the combination of CNN and the four phenotypes was the best for the identification of 75 varieties,and the error was stable at 0.2,which confirmed the powerful function of deep learning.In addition,the spectral characteristic curve of three different wavelengths would be analysed,and the results of the full spectrum model showed that the NIR spectral curves had highly redundant specific fingerprints.Further,a variety of feature extraction methods were introduced to extract specific spectral characteristics and the result of identification model based on full spectrum and combination of characteristic variables were compared.It concluded that the characteristic spectrum has powerful advantages in time cost of computation,and verified the advantages of CNN and SPA feature screening algorithm in feature extraction and strong generalization ability of ELM neural network.(2)Establishing regression model for spectral characteristics and component content based on seeds’phenotypic data.20 chemical components information of 75 varieties of pumpkin seeds were obtained to make statistical analysis.And 6 important and independently distributed quality components were selected:starch,soluble sugar,protein,methionine,fat,glycine.PLS regression model for 3 spectral fingerprint and 6 ingredient information were built and compared.Results showed that the R~2 between the terahertz transmittance characteristic fingerprint and chemical composition reached the maximum value of 0.96,which is higher than the near-infrared spectral value of reflectivity(R~2 0.9)and far higher than the visible spectral data of reflectivity(0.8),which proved the true reflection of terahertz spectral information on grain quality.Among regression results of 6 chemical composition,results was relatively satisfactory:relation R~2 of fat and glycine the highest,protein and methionine slightly higher,starch the lowest.That indicated strong corresponding relation between spectral fingerprint information and internal chemical composition,confirming the feasible of advantage using spectral features of phenotypic information to detect seeds quality.(3)Genetic diversity study of 75 pumpkin cultivars based on genome-wide data.The genome-wide data using high-throughput sequencing technology was compared to the reference genome for mutation detection and mutation quality control,with more contributing information locus strictly screened and filtered,and 47,796 SNP locus left.System evolutionary tree and population structure analysis were introduced to explore genetic similarity of 75 varieties of pumpkin.The results of evolution tree and the stacked graph of the genetic makeup through clustering relations showed that the whole group dividing into 1,3,5 subgroups was rational and acceptable result.Principal component analysis,kinship analysis and linkage disequilibrium analysis were studied and found that the LD attenuation of gene expression was very fast,consistent with the strong pumpkin fertility and the advantage of cross-pollination,showing the great differences among genetic background material for pumpkin samples and its low level of selection,verifying the population genetic diversity of 75 pumpkin varieties.(4)Constructing genome-wide association analysis model for seed size of pumpkin phenotypes.GWAS association analysis was performed based on the obtained gene data and the significant phenotypic quantitative trait:area(seed size).Four GWAS correlation models:GLM,GLM(Q),MLM(K)and MLM(QK)were constructed.Results of QQ plot indicated that GLM was more reasonable than MLM.For M plot of GLM,three significant SNP locus were screened out,existing 29 candidate genes associated with markers.GO and KEGG enrichment analysis were employed.Candidate genes GO-0009536,GO-0044711,GO-0090407,GO-0004396 and GO-0019200 with relation to the seed size of pumpkin samples were identified,locating at 3,5 and17 chromosome:CMOCH03G001000,CMOCH05G011710 and CMOCH17G005830.These genes played an important role in the regulation and control synthesis of internal organic compounds,enzyme activity and metabolism of biological macromolecules. |