Font Size: a A A

Construction Of Pathogen Database And Classification Of Pathogen Sequences Based On Deep Learning

Posted on:2021-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:K LuFull Text:PDF
GTID:2370330614970434Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
Infectious diseases are caused by pathogens,and the diseases can spread,directly and indirectly,from one person to another.It has claimed countless lives and threatened the health and safety of all mankind.In the long history of mankind,massive infectious disease outbreaks have repeatedly occurred,causing uncountable casualties and incalculable losses,such as leprosy in the 13 th century,black death in the 14 th century,smallpox in the 16 th century,cholera Typhus and tuberculosis in the 19 th century,the Spanish pandemic,polio and AIDS in the 20 th century,malaria and SARS in the 21 st century,etc.These infectious diseases have caused thousands of deaths and have always been the main enemies of human health and safety.According to an important study in the Lancet,infectious diseases,maternal and neonatal symptoms and nutritional status caused more than 10 million deaths worldwide in 2017,accounting for 18.57% of the total deaths.Especially in some low-and middle-income countries,deaths caused by infectious diseases are dominant.For example,diarrhea caused by infectious diseases is the main cause of death in Kenya.For a long time,infectious diseases have been the dark clouds hanging over us,threatening our lives,health and safety all the time.Emerging infectious diseases are one of the most noteworthy areas of infectious diseases.Emerging infectious diseases refer to emerging and suddenly re-emerging infectious diseases.Since the 21 st century,medical and health conditions have been greatly improved,but emerging infectious diseases continue to emerge,such as SARS in 2002,H1N1 influenza in 2009,MERS in 2012,Ebola in 2014 and 2019-n Co V.These emerging infectious diseases not only directly cause a large number of casualties,but also indirectly seriously affect economic development and social stability.With the continuous development of the medical and health field,extraordinary achievements have been made in the diagnosis,treatment and vaccine development of infectious diseases.As the process of globalization accelerates,population mobility and global trade have increased,which increase the spread risk and speed of emerging and emergent infectious diseases.Therefore,the surveillance,prevention and control of sudden and emerging infectious diseases have become a problem that the world needs to work together to deal with.Infectious diseases are transmitted through pathogens,which refer to microorganisms or other media that cause infectious diseases to infect human beings or animals and plants,such as viruses,bacteria,fungi,protozoa,parasites and so on.Pathogens have the characteristics of diverse species,diverse intermediate hosts,diverse ways of transmission,fast mutation and evolution,and these characteristics make infectious diseases difficult to be eliminated.In the past few decades,the most concerned pathogens were viruses,such as HIV,SARS and Ebola.It is estimated that viruses account for up to 44% of the known pathogens.In addition,due to the characteristics of short replication cycle and high mutation rate of the virus,viral infectious diseases have become the main new and sudden infectious diseases in recent years.Bacteria and rickettsia accounted for 38%,causing 30% of new infectious diseases.In addition,many bacterial infectious diseases are also making a comeback,such as plague and cholera,and the emergence and spread of antibiotic resistance make it difficult to eliminate pathogenic bacteria.In view of the increasing variety of pathogens and the gradual acceleration of variation and evolution,it is particularly important to collect and integrate pathogenrelated resources.So this study developed the database PDID.The database includes genomic data of 55 infectious diseases,59 pathogens,antibiotic resistance data and virulence factor data,covering a total of 39 types of infectious diseases stipulated in the Law of the People's Republic of China on the Prevention and treatment of Infectious Diseases(revised in 2013).At the same time,9 kinds of genomic variation evolution online analysis tools are provided.By integrating a variety of related resources of infectious disease pathogens,the database aims to provide researchers with a comprehensive database and one-stop analysis platform with friendly interface,data availability and convenience,so as to promote the research process of infectious disease pathogens and protect human life,health and safety.In addition to the collection and integration of pathogen resources,pathogen identification is an important prerequisite for clinical treatment and related vaccine specific drug development.At present,there are mainly four diagnostic methods: isolation and culture,microscopic examination,antigen component detection and molecular biological nucleic acid diagnosis.Isolation,culture and microscopic examination are time-consuming and labor-consuming,the sensitivity is low;the false negative detection of antigen components is high.In comparison,the commonly used nucleic acid detection and diagnosis,RT-PCR,amplification method has the advantages of high speed,high sensitivity and high specificity,but it also has some disadvantages,such as the inability to detect new and highly variant virus strains,high requirements for primers,the need for laboratory personnel to master the corresponding experimental operation skills,and so on.Therefore,there is an urgent need to develop an accurate,efficient and rapid pathogen detection and diagnosis technology.Deep learning(DL)is one of the sub-fields of machine learning(ML).It realizes quasi-artificial intelligence(AI)by imitating the process of human brain processing information and making decisions.In recent years,with the rapid development and accumulation of data,deep learning has gradually matured.At present,deep learning has shined brilliantly in many fields,such as intelligent translation,self-driving,intelligent assistant,face recognition,personalized recommendation and so on.Especially in the biomedical field,the rapid development of high-throughput sequencing technology has led to the exponential growth of biomedical data,and deep learning has also been widely used,mainly in the following three aspects: first,biomedical image recognition and classification,such as the use of convolutional neural network(CNN)for brain tumor image segmentation,pancreatic CT image segmentation and colon cancer image recognition and classification.Second,proteome data analysis and protein structure prediction,such as using CNN to predict protein ordered or disordered regions and protein structure.Third,the analysis of genome sequencing data,such as using recurrent neural network(RNN)to predict transcription factor binding sites and DNA splicing regions.However,at present,there is no research on using neural network model to identify and classify the genome sequence of infectious disease pathogens.Coronavirus(Co V)is a single-stranded positive-strand RNA virus with envelope,which causes a variety of diseases in mammals and birds.Severe acute respiratory syndrome coronavirus(SARS-Co V)in 2002,Middle East respiratory syndrome coronavirus(MERS-Co V)in 2012 and novel coronavirus(2019-n Co V)in 2019 have a huge negative impact on national health,social stability and economic development.In addition,due to the frequent variation and evolution of the genome of pathogens of new viral infectious diseases,traditional RT-PCR methods can't play a role in the early stage of the outbreak because they do not have available probes.So they can only rely on laboratory culture observation and genome sequencing data for bioinformatics analysis,but the process of culture and analysis is often too long.If the prediction can be given before pathogen isolation,it can point out the direction for pathogen isolation and greatly improve the efficiency of pathogen isolation.At the same time,it is also helpful to quickly locate the target pathogen type,and quickly design pathogen PCR primers after testing the full-length genome sequence of pathogens,so as to gain valuable time for the prevention and control of new infectious diseases.In order to improve the detection efficiency and performance of novel and highly variant coronavirus sequences in high-throughput sequencing data of samples,a tool for rapid detection of coronavirus sequences of new and highly variant strains was developed by collecting coronavirus and human whole genome sequences,simulating the highthroughput sequencing data of samples infected with coronavirus,and training the neural network model based on gated recurrent unit(GRU)in cyclic neural network.Compared with traditional bioinformatics tools,this tool can shorten computing time,reduce computing resource requirements,and avoid downloading reference genomes.The accuracy,sensitivity and specificity of the tool on verification set and test set are more than 99%,and the sensitivity on 2019-n Co V independent test set is 99.81%,indicating that the model has good generalization ability.
Keywords/Search Tags:pathogens, database, deep learning, coronavirus, sequence classification
PDF Full Text Request
Related items