Integration of hepatitis B virus(HBV)has been proven to be one of the main causes of liver cancer.Therefore,it is of great significance to study the detection method of HBV integration site and its biological mechanism for the prevention and treatment of liver cancer.HBV integration is an insertional variation formed by embedding the DNA sequence of HBV into the DNA sequence of the host genome,but the effect of using traditional insertional variation detection tools to detect HBV integration is not ideal.Therefore,there is a need to develop a tool that can accurately detect HBV integration sites.In addition,a large number of studies have shown that HBV integration can lead to the occurrence and development of liver cancer,but its specific mechanism of action and the selection of integration sites are still unclear.In order to solve these two problems,this paper first proposes a detection channel for HBV integration sites based on feature coding,and named it Virol SDC.Then,bioinformatics methods were used to analyze the distribution of integration sites in cancer tissues and paracancerous tissues,in order to explore the specific mechanism of HBV integration and its carcinogenesis.The specific work of this paper is mainly divided into the following four aspects:(1)In order to enable the NLP model to better apply sequencing data,the corresponding corpus generation rules were formulated,and feature encoding was performed on the corpus data.In this paper,the sequence alignment data is regarded as a special language,so the CIGAR field in the sequence alignment feature within the detection window is converted into corpus data mainly composed of five words: M,S,H,and O.And convert the corpus data into a tensor composed of word vectors through a pre-trained bioinformatics model,and provide data for the subsequent integrated site detection model to capture the general rules and semantic information of the special language of sequence comparison information premise.(2)In order to accurately detect HBV integration sites,a complete integration site detection channel was designed,and a variation detection model was constructed and improved by extracting salient features.Experiments prove that the HBV integration site detection channel proposed in this paper can effectively detect HBV integration sites and filter non-integration insertion mutations.The comparative experiments also show that the method of extracting salient features can effectively improve the gradient disappearance and gradient explosion caused by the feature sequence being too long,and make the integrated site detection model fit faster and more accurately.(3)In order to study the effectiveness of the HBV integration site detection method proposed in this paper,a series of experiments were designed to compare five tools including Virol SDC.The experiment compared the performance of the five tools in the data of different depths and different integrated sequence lengths on the real data set and the simulated data set.The impact of integrated sequence length and sequencing noise is relatively small,and it has better comprehensive performance.(4)In order to study the specific mechanism of HBV integration and carcinogenesis,this paper studies the integration sites in different tissues,different genders and different gene sequences.The experimental results showed that HBV was more likely to integrate into chromosomes 5,8,17,19 and 20 in cancer tissues,and more likely to integrate into chromosomes 17,19,20 and 22 in paracancerous tissues,and cancer tissues were more likely to integrate into chromosomes 17,19,20 and 22 than cancer tissues.Adjacent tissues had a higher correlation for HBV integration,and men were more likely to be invaded by HBV than women.Finally,it was also found that hepatitis B virus may affect the occurrence of liver cancer-related genes such as TERT,KMT2 B and FN1 in the process of causing the occurrence and development of hepatocellular carcinoma,and found PDCD6-AHRR,which is independent of NCG and HCCDB databases HCC-related genes. |