Research On Macrogenome DNA Sequences Clustering Algorithm Based On Reference Species Label Constraint Pre-training

Posted on:2022-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:W Liu

Full Text:PDF

GTID:2480306761460354

Subject:Computer Software and Application of Computer

Abstract/Summary:

PDF Full Text Request

Metagenomics uses next-generation sequencing technology to obtain most of the genetic material（DNA）of microorganisms in the environment without laboratory cultivation,and then uses the research strategy of genomics to study the genetic composition of microorganisms in environmental samples and its community function.In recent years,metagenomics has been studied more and more widely in humans,animals,plants and the environment.Different from the traditional sequencing methods,the original data obtained by metagenomics sequencing is a large number of short-length DNA fragments derived from various microorganisms.According to the overlapping relationship between DNA fragments,they can be assembled into DNA sequences with long lengths.In bioinformatics,the assembled DNA sequences are called contigs.Since the contigs sequences cannot obtain complete genomes,it is necessary to divide the contigs according to species by binning contigs.How to effectively divide the metagenomic DNA contigs is the focus and difficulty of current research.At present,there are still some problems,such as: 1)Clustering performance needs to be improved: the traditional clustering algorithm cannot distinguish neighboring species;2)Determination of the number of clusters: There is a certain gap between the number of clusters currently determined and the number of real species;3)contigs features: There is a problem of clustering "unfriendliness" in the feature distribution of contig samplesIn order to solve the above problems,this paper proposes a pre-trained metagenomic deep metagenomic contig clustering algorithm（Label-Constrained Deep Clustering,LCDC）based on known reference species label constraints.The main work is as follows:（1）Construct a pre-training dataset based on 4-mer frequency.According to the proportion of species distribution in different communities,the community sample set of human intestine,soil and marine microbial environment was constructed respectively.Download all genome sequences in the community sample set,intercept them into pre training sequences in proportion,calculate the 4-mer characteristic frequency of each sequence,normalize the 4-mer characteristic frequency with the comparison maximum method,and obtain the pre training data set in human intestine,soil and marine environment.（2）A pre training method based on known species label constraints is proposed to solve the clustering problem of adjacent species of contigs.The five layer deep self coding network with symmetrical structure established by pre training makes it easier to box the overlapping groups between adjacent species with similar characteristics.Because only using the network reconstruction error as the pre training loss function can not well complete the box work of contigs,based on this,this paper designs a pre training method based on the known species label constraint,and uses the three community pre training data sets as the input to pre train the network respectively.The loss function includes the network reconstruction error and the species label constraint error,The species label constraint error uses the exponential function as the bottom,and the Adam optimizer is used to minimize the pre training loss function to pre train the network,and the network obtained after pre training is saved.Based on the pre training method of known species label constraints,this paper increases the hidden layer representation of similar species networks by reducing the constraint error of species label.（3）A joint pre training deep K-means clustering algorithm is proposed.In order to solve the problem of "unfriendliness" of clustering in contigs samples,a joint pre training deep contigs clustering method of learning while clustering is proposed.The network saved by label constraint pre training is used as the initialized network,and the DBI index is used as the evaluation index of the number of clusters of contigs,which solves the problems of inaccurate number of clusters and "unfriendliness" of clustering in contigs samples.Finally,the LCDC method in this paper is compared with the existing automatic box sorting method of contigs,and the optimal experimental results are obtained.To sum up,the joint pre training deep contigs clustering method proposed in this paper solves the problems of inaccurate box sorting caused by the acquaintance characteristics between the familiar species in the contigs,inaccurate number of boxes in the overlap and "unfriendliness" of clustering in the overlap,and provides support for the research in the field of contigs box sorting.

Keywords/Search Tags:

Metagenomes, k-mer, Deep self-coding network, Contigs clustering, Tag constraints

PDF Full Text Request

Related items

1	Research On Metagenome DNA Contigs Clustering Algorithm Based On Deep Graph Convolution And Autoencoder Networks
2	Identifying Viral Sequences And Phage Lifestyles From Metagenomes Based On Deep Learning
3	Research On Classification Algorithm Of Metagenomic DNA Contigs Based On Depth Density Clustering
4	Research Of Fuzzy Clustering Method On Imbalanced Dataset And Its Application In Metagenomic Contigs Binning
5	Effectively Clustering Reads Of Metagenomes
6	Research On Buildings Clustering Methods Based On Multi-Constraints
7	Research Of Metagenomics Contigs Converging And Gene Tagging Algorithm
8	Studies On The Prediction Of Long Non-coding RNAs Based On Deep Neural Network
9	Research Of Metagenomic Contigs Clustering Method Based On Improved Density Peaks
10	Research And Application Of Deep Clustering Methods For Metagenomic DNA Reads