Font Size: a A A

Research Of Metagenomic Contigs Clustering Method Based On Improved Density Peaks

Posted on:2021-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LiangFull Text:PDF
GTID:2370330620472157Subject:Control engineering
Abstract/Summary:PDF Full Text Request
Microbes are characterized by a large number and complexity.Under most conditions,humans cannot simulate and reproduce the original environment in which they live.As a result,most microorganisms in the environment cannot be obtained by pure culture.Compared with traditional research methods,metagenomics It is simpler to learn and operate,and can directly obtain the DNA of all microorganisms from environmental samples,which plays an irreplaceable role in the research of microorganisms.Metagenomics research helps: determine species composition and relative abundance,obtain microbial community structure and function analysis,explain metabolic networks between species,discover valuable metabolites,and bacteria that play a key role in a specific biological environment Detection of species and functional genes,comparison of different microbial communities,etc.Metagenome sequencing raw data comes from the short genome fragments of all microorganisms in the environment.By connecting the continuous overlapping sequences at the ends,the continuous long DNA fragments are called DNA contigs.Classification of these contigs according to their species is a very important research content in metagenomic analysis,and accurate classification results will provide reliable support for the analysis of species diversity of microbial communities.However,due to the constraints of abundance ratio,genome length,number and other factors among various species,it is difficult to achieve the ideal classification effect.Therefore,how to effectively divide the metagenomic contigs is the focus of current research.At present,the difficulties in the classification of metagenomic contigs are mainly as follows:(1)The problem of accurate estimation of the number of species in the metagenomic group: the number of metagenomic species is large and the distribution range is wide,so it is often impossible to accurately determine the number of species.(2)Accurate clustering of DNA contigs: Since the clustering effect is affected by the length and number of contigs,it is of great significance to improve the accuracy of clustering.Therefore,this article conducts the following research on the above existing difficulties:(1)Metagenome contig feature extraction based on k-mer frequency.In this paper,k-mer frequency is first used to extract DNA sequence features to construct different feature matrices for classification of metagenomic groups.Among them,the process of feature extraction is mainly divided into the following points: data selection and pre-processing,calculation and generation of feature vectors,denoising of feature vectors,and calculation of the distance between feature vectors and the relationship between each vector,thereby Construct a feature matrix.(2)Propose an improved density peak algorithm to automatically determine the number of clusters.This paper overcomes the shortcomings of the traditional manual identification of cluster centers and the determination of the number,and uses an algorithm based on improved density peaks to automatically obtain the k value of the number of species.First,calculate the density and distance values of each data point,and then normalize to construct the decision map to which the data points are mapped;then,according to the location information of the decision map,calculate the new density value corresponding to the data point on the decision map,according to The distribution of density values and the two principles of determining clustering centers automatically obtain the k value of the number of species;and verify the proposed algorithm on different types of data sets,confirming the effectiveness of the algorithm.(3)Propose a classification strategy of metagenomic contig based on improved density peak.According to the number k of species obtained above,this paper uses the method based on density peak to complete the clustering of metagenomic contigs: according to the above method,the data points corresponding to the cluster centers can be obtained,and then each remaining point is assigned Go to the same clustering principle as its nearest high-density neighbor,classify the remaining points,define a boundary area for each class formed,and filter the noise points through the density threshold to complete the metagenomic overlap Clustering of groups.By comparing with existing contiguous clustering methods,under different evaluation criteria,it is proved that the algorithm proposed in this paper can achieve better clustering effect.In summary,this paper conducted a systematic metagenome contig classification study,constructed a feature matrix,proposed an algorithm for automatically determining the number of species,improved the density peak-based metagenome contig classification strategy,and achieved more than existing methods.Good classification effect.
Keywords/Search Tags:cluster analysis, k-mer frequency, automatic number determination, density clustering, metagenomic contig
PDF Full Text Request
Related items