Font Size: a A A

Balancing utility and anonymity in public biomedical databases

Posted on:2006-08-15Degree:Ph.DType:Dissertation
University:Stanford UniversityCandidate:Lin, ZhenFull Text:PDF
GTID:1456390008961193Subject:Biology
Abstract/Summary:
Interest in understanding how genomic variations influence heritable disease and drug response is intense. Providing unrestricted electronic access to biological and clinical data promotes scientific research. However it risks disclosing the identity and health information of research subjects. These data may contain fingerprint-like unique signatures that can identify people even without links to commonly known identifiers. Thus medical data, especially those containing human genetic information, should only be made public if we can protect the privacy of these research subjects adequately.;To address this challenge, I have investigated automatic methods to analyze the disclosure risk of public single nucleotide polymorphism (SNP) databases. Such an analysis depends critically on detailed understanding of patterns of linkage disequilibrium (LD) among SNPs throughout the human genome. Knowledge of this structure allows accurate estimate of the disclosure risk of a SNP database and sufficiently flag records with potentially unacceptable high disclosure risk.;Specifically, I have investigated mechanisms for restricting SNPs released publicly and for quantifying their disclosure risk. First, I implemented a metric to quantify the information content in binning, a novel method to reduce the specificity of phenotypes and genotypes like SNPs. The results show that the utility of the binned data becomes minimal at relatively small bin sizes. Therefore, the binning method may be unuseful to protect genome-wide SNP data.;I further illustrated that existing data obfuscation methods either are insufficient to prevent disclosure, or ruin information content in the context of the whole-genome SNP analysis. The evaluation on the probability of identifying individuals based on independent SNPs showed that research-use SNP data are highly identifiable, and a small subset of independent SNPs in the genome (30--80) would lead to a successful individual identification.;Finally, to quantify the disclosure risk of SNP databases, I developed principal component analysis and odds ratio test based methods to identifying tagging SNPs from the genome, and applied these methods to experimental datasets with different patterns of LD. The evaluation results shows that the tagging SNPs identified are the ones potentially useful for genotype-phenotype association studies, yet they are also the ones with strong ability to distinguish individuals.
Keywords/Search Tags:Data, SNP, Disclosure risk, Public
Related items