| The development of high-throughput metagenomic sequencing technologies enables the discovery of microbial species directly from natural environments or the human body.Due to this advantage,metagenomic techniques have become the main manner to study microbial communities and have been more widely studied in humans,animals,plants and the environment.Microbial DNA reads obtained by metagenomic sequencing had a shorter length.In metagenomic studies,they can be assembled into long DNA sequences according to the overlapping relationship between DNA reads.In bioinformatics,the assembled DNA sequences are called contigs,usually can not completely reconstruct individual genomes because of several factors,such as low abundances and sequencing errors.Therefore,clustering DNA sequences to build assembled genomes is a necessary step in analyzing the metagenomic species structure.However,the metagenomic datasets are large,with uneven abundance ratios and large number of species.The performance of existing methods for DNA contigs clustering remains to be improved.For this purpose,a binning algorithm based on graph convolutional neural network,GCNbin,and a multi-label binning optimization tool based on ZINB-autoencoder and label propagation,Viabin,were proposed in this thesis.The main work is as follows:(1)Heterogeneous feature extraction of metagenomic DNA contigs.4-mer features and coverage features of DNA contig sequences were extracted and normalized to construct a feature matrix as a classification feature vector.Construct the graph structure information of contigs samples,including Assembly graph,PE graph,the assembly graph,PE graph were constructed from the assembly tool and sequence alignment tool,and a method of using prior probability information to modify kernel function to construct KNN graph was proposed,and perform the normalization of the graph adjacency matrix,in order to obtain more abundant sample relationship structure features in a multi-graph strategy.(2)A DNA contigs clustering algorithm based on dual self-supervised learning of heterogeneous features,GCNbin,was designed.GCNbin synchronously learns the sequence features and the sample relationships information,through an autoencoder and graph convolutional neural network,and performs self-supervised feature learning and clustering by the dual mutual supervision strategy to complete the binning task.high-confidence bins are obtained by plugging the output clustering probability of the softmax layer.In addition,GCNbin can automatically determine the sample species number k by single-copy gene search,thus without artificial input.GCNbin is compared with several state-of-theart metagenomic bining tools on six simulated and real datasets.The experimental results have shown that GCNbin has achieved the best performance in terms of the F-1 metric.Especially,GCNbin can still maintain outstanding binning performance without coverage features.(3)A multi-label binning optimization tool for metagenomic contigs based on assembly graph and ZINB autoencoder,Viabin,was designed.Viabin performs preliminary clustering of unclassified sequences using a ZINB-autodecoder and Gaussian mixture model,then using gene alignment techniques and assembly graph based on the other binning tools.Experiments show that Viabin can improve the recall rate of other binning tools at the level of the contig species,and support the multi-label assignment of the contigs of some shared species,which can improve the completeness of the final bins,and obtain more high-quality bins.In addition,Viabin has the advantages of portability in running speed and memory usage.In conclusion,this thesis designed the metagenome DNA contigs clustering method based on deep graph convolutional neural network and autoencoder,solved the clustering effect of single metagenomes due to insufficient contigs feature,and the problem of difficult to divide short contigs and multi-label contigs effectively,which provides support for the field of metagenomic species structure research. |