Font Size: a A A

Design And Implementation Of Copy Number Preprocessing System Based On PCF

Posted on:2018-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:Z LiuFull Text:PDF
GTID:2428330596952961Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Genomic copy number variation(CNV)refers to genomic structural variation that occurs in humans and other mammals.It is the leading cause of certain diseases.In order to explore the pathogenesis of copy number variation,researchers usually use a gene chip to measure the genome-wide copy number of cancer cell samples.However,cancer samples inevitably contain some normal cells except for cancer cells,resulting that the copy number of cancer cell sample detected by the chip mixed with normal cells.Thus the copy number of cancer samples deviate from its true value.The aim of this study is to design a copy number preprocessing system for the copy number of cancer cell sample detected by SNP6 chip,and to calculate the normal cell fraction in cancer sample,and restore the true copy number of cancer cells.The main research of this paper is as follows:(1)On the basis of BACOM algorithm,this paper presented a set of copy number preprocessing method.The data source is the normal and cancer sample pair.We designed a new data processing pipeline that we could obtain the real copy number which was corrected by removing those normal cells in caner samples.The system provided an absolute normalization criterion for the follow-up detection of CNV region.By comparing this system with BACOM method and ABSOLUTE method,the feasibility and validity of this system are verified.(2)For extracting the allelic balance loci,firstly,this paper identifed AB-type site by K-Means,The Pearson correlation coefficient was obtained by calculating the site of genotype AB in the sliding the window.Then through the interval matching method,this paper found the true integer value corresponding to the copy number of observations,and completed the normalized correction of genome-wide copy number.By comparing the BACOM algorithm which only judge the missing type,it was verified that the interval matching method has the advantage in determining the number of copy number types.(3)In the module of copy number segmentation,a variety of segmentation algorithms were implemented and compared.First of all,this paper implemented the mainstream segmentation algorithm,including segmentation algorithm based on HMM,segmentation algorithm based on recursive thought,Lasso-based segmentation algorithm,and PCF segmentation algorithm.Secondly,this paper designed a set of simulation data generation model based on random number generator and multi-template,and used this model to generate a set of simulation test data sets to compare and analyze the above-mentioned segmentation algorithm.Finally,Taking the accuracy and segmentation efficiency of segmentation algorithm as measurement criteria,the PCF algorithm was favorable in both sides.(4)On the basis of the above theoretical research,this paper introduced the distributed computing framework Apache Spark,and transplanted the K-Means clustering of the genotyping and the correlation coefficient calculation of the equilibrium site to the Spark platform,and verified the feasibility of transplanting the single-core algorithm to distributed algorithm which based on Spark platform.
Keywords/Search Tags:Copy number variation, Segmentation algorithm, Spark, Allele extraction, Least square method
PDF Full Text Request
Related items