Font Size: a A A

Research On Classification Algorithm Of Metagenomic DNA Contigs Based On Depth Density Clustering

Posted on:2022-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y JinFull Text:PDF
GTID:2480306329987419Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
Metagenomic is a discipline for studying DNA sequences directly from an environmental sample to study and explore a discipline for problems such as microorganisms,composition of species,and relative abundance.In recent years,the research in human,animal,plants and environments have become more extensive.However,the metagenomic group contains a large number of unknown types of microorganisms,and the original data obtained through metagenomic sequencing is a large number of short-length DNA fragments.The sequencing DNA fragment is referred to as the original data set of the metagenomic,and the DNA fragment composed of the genome short segment through the continuous overlapping sequence of the end is referred to as DNA contigs.In the study of the metagenomic,the key step is to classify the DNA contigs according to the properties of the species.However,the ideal classification effect will be limited by many factors,such as the number of DNA contigs,the differentiation of the abundance ratio in various species,and the inconformity of DNA contigs.Therefore,how to effectively divide the metagenomic DNA contigs is the focus and difficulties of current research.There is a problem with the difficulty of effective classification problems in the metagenomic DNA contigs: 1.In the metagenomic data set,different length of each DNA contigs group generates the problem;2.The problem of efficiently and accurately clustering the DNA contigs.Therefore,this paper proposes the following solutions for some of the key points and difficulties present in the current DNA contigs classification problem:(1)Extraction of k-mer frequency feature of metagenomic DNA contigsMetagenomic data features is a large number of DNA sequences of different species,so a large number of DNA sequences are classified in accordance with their species attributes.The digital feature of each DNA fragment is extracted with the kmer frequency before the metagenomic analysis is performed,and the feature matrix is constructed as the experimental data sets.Since the metagenomic data set belongs to an unbalanced dataset,the original data feature problem has caused a series of problems such as the neutral source species and short sequence characteristics of the k-mer,which will affect the clustering performance.Therefore,it is necessary to learn from the cluster.(2)Build a VAE(Variational Auto-Encoder)feature learning model based on contigs length characteristics.In this paper,by weighting the length characteristics of the DNA contigs,the problem of the inconsistent length of the DNA contigs affecting the clustering effect is solved.First,the VAE model structure and loss functions that match it itself are trained in accordance with the characteristics of the metagenomic data set.Then,the weighted feature vector is input as the VAE input.VAE,based on DNA contigs length feature,is composed of two neural networks,encoders map high dimension input to lowdimensional coding(called potential representation),and the decoder maps the encoding to the output,its output and original input the size is the same.The contig is represented as a point in the high dimensional space,and thereafter the bins of the contigs are only the cluster of points in the high dimensional space.(3)Propose a classification strategy for metagenomic contigs based on depth density clustering.Based on the weighted and deep learning feature vectors,this paper uses the improved DBSCAN(Density-Based Spatial Clustering of Applications with Noise)density clustering algorithm to complete the clustering of metagenomic contigs:DBSCAN's two parameters,is based on LSH(Locality Sensitive Hings,Local sensitive hash)improved DBSCAN clustering algorithm automatically obtains adjacent radii and sampling points,avoiding clustering errors caused by manual input,and finally two parameters and feature vectors are used as input,thereby performing intact depth density clusters.In conclusion,this paper proposes a systematic study on a clustering algorithm model based on non-supervised deep learning,and systematically studying the metagenomic contigs classification.Building a feature matrix,feature the cintigs length in the feature matrix,constructing a VAE feature learning model suitable for the metagenomic data set,and proposes a DBSCAN density clustering algorithm based on the nearest neighbor search parameter,and obtains more than the existing box method.Good classification effect.
Keywords/Search Tags:cluster analysis, k-mer frequency, metagenome DNA contigs, depth cluster, contigs cluster, density cluster
PDF Full Text Request
Related items