| Bacteriophages are viruses that infect bacteria.They can be divided into lytic phages and lysogenic phages.The lytic phages proliferate and lyse host bacteria.Lysogenic phages integrate their genome into the host genomea and cause lysis of the host bacteria under the influence of certain physical and chemical factors Replication and passage are the main modes of their existence.At this time,a relatively stable parasitic relationship is established between the lysogenic phage and t he host bacteria.The lysogenic phage integrated into the host genome is called prophage.Since lysogenic phages have the property of leading horizontal gene transfer,they often have a major impact on the pathogenicity of bacteria.For example,Shiga toxins I and II produced by enterohemorrhagic Escherichia coli are leaded by lysogenic phage.Therefore,in order to better understand the formation of bacterial virulence,it is necessary to accurately predict the presence of lysogenic phage on bacteria.However,the current discovery of lysogenic phage mainly uses artificial methods such as experimental induction and bioinformatics methods,and the efficiency is very low.O n the other hand,current automated prediction tools can only predict the prophages on the bacterial genome,but cannot judge whether these prophages are functional or not,and these tools are impossible to extract the complete sequence of the corresponding lysogenic phage according to functional phage corresponding to.In response to the above problems,this paper proposes an automatic and accurate functional prophage prediction algorithm: LysoPhD.The algorithm uses the original sequencing data more than once to predict the functional phage s and extract the corresponding lysogenic phage sequence from the bacterial genome.At the same time,this paper performs multi-thread parallel optimization on LysoPhD,which significantly improves the operating efficiency.This paper also proposes a method for automatically downloading bacterial miseq sequencing data from the NCBI-SRA database,and uses a multi-process method to perform large-scale analysis of the downloaded bacterial sequencing data.We used the predicted lysogenic phages to build database of the lysogenic phages.This article mainly includes the following three aspects of work:The prediction algorithm of functional prophage based on high-throughput sequencing data: LysoPhD.The existing methodsfor predicting and identifing of lysogenic phage(functional prophage)are mainly divided into biological experiment induction and biological information means.Experimental induction is a reliable method of identification.However,the experimental induction method requires a lot of manpower and material resources,and can only be limited to native self-produced bacterial strains.The bioinformatics means find the functional prophage by graphically displaying the connection relationship of contigs,but this method needs manual operation and subjective judgment,and cannot be automated,which severely limits the efficiency of analysis.At present,there are some excellent tools using automated prophage prediction algorithms.However,these tools can only predict the presence of prophage,and it is impossible to further determine whether the prophage is functional or not,and it is impossible to extract the corresponding lysogenic pha ge sequence.In response to this problem,the third chapter of this paper designed a functional prophage prediction algorithm LysoPhD based on high-throughput sequencing data.LysoPhD combines the original seq uencing data with the assembled bacterial genome.A pipeline of quality control filtration is designed to control the quality of the sequencing data and filter the original sequencing data.Then the rough phage range is predicted based on the phage-like gene clusters on the contig of bacterial genome.The precise prophage candidates are then searched based on the integration site on the rough range.Finally,the cyclization information is mined from the original sequencing data,the prophage functionality is verified,and the corresponding complete sequence of lysogenic phage is extracted based on the consensus extension algorithm.The results of induction experiments confirmed that the prediction results of LysoPhD were highly consistent with the experimental results and could effectively predict lysogenic phage of the bacterial genome.Parallel optimization of LysoPhD.The serialized LysoPhD algorithm performs inefficiently,and it takes three hours to execute the bacteria sequencing data at 800 M scale(common scale).The large-scale analysis of massive bacterial sequencing data and the efficiency of constructing lysogenic phage databases are very limited.Therefore,in the front part of the fourth chapter,this paper analyzes the hotspots of LysoPhD and finds parallel parts.In the prophage prediction part,multi-threading is used to parallelize the operations on each contig,and multi-threading is used in the functional verification part.The operation on each precise prophage candidate is parallelized.The test results show that the acceleration ratio of the two parts reaches 8.3 and 7,respectively.The overall acceleration ratio reached 7.25.Construction of the database of the lysogenic phage.Due to the difficulty in predicting lysogenic phage,relatively few studies have been obtained.Therefore,there is no complete lysogenic phage database,which limits the research and analysis of lysogenic phage genome.Therefore,in the fourth part of this paper,a method for constructing lysogenic phage database based on parallelized LysoPhD algorithm is proposed.First,the self-designed script automatically downloads massive bacterial sequencing data from NCBI-SRA database and analyzes the functional prophage in these bacterial sequencing d ata on a large scale using a two-stage parallel approach.The first level uses multiple processes to run multiple bacterial sequencing data analysis,and the second level uses the multi-threaded LysoPhD algorithm for each bacterial sequencing data analysis.This protocol is automated and efficient,and 2000 lysogenic phage sequences have been predicted from approximately 40,000 bacterial data. |