| Objective:Whole exome sequencing is an unbias way to identify the mutations inthe coding regions. Currently, genetic studies in hepatocellular carcinoma(HCC) using the next generation sequencing technology are still limited,and a standard data analysis pipeline is not available. To study theunderlying molecular genetics of virus-related HCC, we compared themethods of pre-processing and variants calling in whole exome sequencinganalyses. Then, we established a pipeline for identification of somaticmutations and applied in whole exome sequencing data from10patients ofHBV-associated HCC.Method:1. We compare the effect of FASTX-Toolkit and Trimmomatic inpre-processing the exome data, the strategies of single-end (SE) inclusionand ‘Hard’ filter and variants quality score recalibration (VQSR) in variantsfilter by using whole exome sequencing data from two test samples. Weassessed the depth of coverage (DP), number of variants,transition-transversion (Ti/Tv ratio), and genotypes concordance under different scenarios.2. We established a pipeline for identification of tumor somaticmutations by using MuTect and SomaticIndelDetector, which is used forcalling somatic point mutation and insertion/deletions (indels), respectively.3. Using publicly available whole exome sequence data from tumortissue and adjacent non-tumor liver tissue of10HBV-associated HCCpatients, we identified somatic mutations underlying HCC, and thenperformed functional pathway analyses by IPA.Result:1. Trimmomatic pre-processed reads showed similar DP to reads thosewithout pre-processing, but significantly greater than those byFASTX-Toolkit pre-processed reads. With DP≥10×and genotype quality(GQ)≥20, the number of called single nucleotide variants (SNV)identified by Trimmomatic was greater than FASTX-Toolkit, but similar tothose without pre-processing. With the inclusion of SE reads, the number ofvariants increased significantly for FASTX-Toolkit pre-processing (~28%)than Trimmomatic pre-processing (~5%). In the all settings,‘Hard’ filteringfiltered less SNVs than VQSR filtering in small sample size.2. After aligned the HBV-related HCC exome sequence reads to thereference sequence without preprocessing, the mean DP ranged from9.76×to19.02×. We identify1100non-silent somatic mutations within926genes,34non-silent somatic Indel within34genes. Out of1100non-silentsomatic mutations,360are novel. The IPA analyses showed that some ofthese genes was associated with cancer and three pathways: GADD45pathway (P=5.42E-03), fatty acid β-oxidation III (P=6.31E-03), andoxidative ethanol degradation III (P=6.85E-03). Conclusion:1. Sequence reads were trimmed and/or filtered moderately byTrimmomatic, whereas it seemed to be over-filtered by FASTX-Toolkit.Keeping the SE reads is good for variants calling in the downstreamanalysis. The ‘Hard’ filtering showed a more favorable tolerability profilethan ‘VQSR’ filtering.2. We established an analysis pipeline of whole exome sequence datafor tumor somatic mutations identification, and applied this pipeline inexome sequencing data from tumor and adjacent normal tissue pairs of10HBV-relative HCC patients. The pipeline will be helpful in our furtherstudy in characterizing the patterns of HCC somatic mutation anddiscovering potential driver mutations underlying virus-associated HCC. |