Font Size: a A A

Research On Macrogenome DNA Sequences Clustering Algorithm Based On Reference Species Label Constraint Pre-training

Posted on:2022-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:W LiuFull Text:PDF
GTID:2480306761460354Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
Metagenomics uses next-generation sequencing technology to obtain most of the genetic material(DNA)of microorganisms in the environment without laboratory cultivation,and then uses the research strategy of genomics to study the genetic composition of microorganisms in environmental samples and its community function.In recent years,metagenomics has been studied more and more widely in humans,animals,plants and the environment.Different from the traditional sequencing methods,the original data obtained by metagenomics sequencing is a large number of short-length DNA fragments derived from various microorganisms.According to the overlapping relationship between DNA fragments,they can be assembled into DNA sequences with long lengths.In bioinformatics,the assembled DNA sequences are called contigs.Since the contigs sequences cannot obtain complete genomes,it is necessary to divide the contigs according to species by binning contigs.How to effectively divide the metagenomic DNA contigs is the focus and difficulty of current research.At present,there are still some problems,such as: 1)Clustering performance needs to be improved: the traditional clustering algorithm cannot distinguish neighboring species;2)Determination of the number of clusters: There is a certain gap between the number of clusters currently determined and the number of real species;3)contigs features: There is a problem of clustering "unfriendliness" in the feature distribution of contig samplesIn order to solve the above problems,this paper proposes a pre-trained metagenomic deep metagenomic contig clustering algorithm(Label-Constrained Deep Clustering,LCDC)based on known reference species label constraints.The main work is as follows:(1)Construct a pre-training dataset based on 4-mer frequency.According to the proportion of species distribution in different communities,the community sample set of human intestine,soil and marine microbial environment was constructed respectively.Download all genome sequences in the community sample set,intercept them into pre training sequences in proportion,calculate the 4-mer characteristic frequency of each sequence,normalize the 4-mer characteristic frequency with the comparison maximum method,and obtain the pre training data set in human intestine,soil and marine environment.(2)A pre training method based on known species label constraints is proposed to solve the clustering problem of adjacent species of contigs.The five layer deep self coding network with symmetrical structure established by pre training makes it easier to box the overlapping groups between adjacent species with similar characteristics.Because only using the network reconstruction error as the pre training loss function can not well complete the box work of contigs,based on this,this paper designs a pre training method based on the known species label constraint,and uses the three community pre training data sets as the input to pre train the network respectively.The loss function includes the network reconstruction error and the species label constraint error,The species label constraint error uses the exponential function as the bottom,and the Adam optimizer is used to minimize the pre training loss function to pre train the network,and the network obtained after pre training is saved.Based on the pre training method of known species label constraints,this paper increases the hidden layer representation of similar species networks by reducing the constraint error of species label.(3)A joint pre training deep K-means clustering algorithm is proposed.In order to solve the problem of "unfriendliness" of clustering in contigs samples,a joint pre training deep contigs clustering method of learning while clustering is proposed.The network saved by label constraint pre training is used as the initialized network,and the DBI index is used as the evaluation index of the number of clusters of contigs,which solves the problems of inaccurate number of clusters and "unfriendliness" of clustering in contigs samples.Finally,the LCDC method in this paper is compared with the existing automatic box sorting method of contigs,and the optimal experimental results are obtained.To sum up,the joint pre training deep contigs clustering method proposed in this paper solves the problems of inaccurate box sorting caused by the acquaintance characteristics between the familiar species in the contigs,inaccurate number of boxes in the overlap and "unfriendliness" of clustering in the overlap,and provides support for the research in the field of contigs box sorting.
Keywords/Search Tags:Metagenomes, k-mer, Deep self-coding network, Contigs clustering, Tag constraints
PDF Full Text Request
Related items