Balancing utility and anonymity in public biomedical databases

Posted on:2006-08-15

Degree:Ph.D

Type:Dissertation

University:Stanford University

Candidate:Lin, Zhen

Full Text:PDF

GTID:1456390008961193

Subject:Biology

Abstract/Summary:

Interest in understanding how genomic variations influence heritable disease and drug response is intense. Providing unrestricted electronic access to biological and clinical data promotes scientific research. However it risks disclosing the identity and health information of research subjects. These data may contain fingerprint-like unique signatures that can identify people even without links to commonly known identifiers. Thus medical data, especially those containing human genetic information, should only be made public if we can protect the privacy of these research subjects adequately.;To address this challenge, I have investigated automatic methods to analyze the disclosure risk of public single nucleotide polymorphism (SNP) databases. Such an analysis depends critically on detailed understanding of patterns of linkage disequilibrium (LD) among SNPs throughout the human genome. Knowledge of this structure allows accurate estimate of the disclosure risk of a SNP database and sufficiently flag records with potentially unacceptable high disclosure risk.;Specifically, I have investigated mechanisms for restricting SNPs released publicly and for quantifying their disclosure risk. First, I implemented a metric to quantify the information content in binning, a novel method to reduce the specificity of phenotypes and genotypes like SNPs. The results show that the utility of the binned data becomes minimal at relatively small bin sizes. Therefore, the binning method may be unuseful to protect genome-wide SNP data.;I further illustrated that existing data obfuscation methods either are insufficient to prevent disclosure, or ruin information content in the context of the whole-genome SNP analysis. The evaluation on the probability of identifying individuals based on independent SNPs showed that research-use SNP data are highly identifiable, and a small subset of independent SNPs in the genome (30--80) would lead to a successful individual identification.;Finally, to quantify the disclosure risk of SNP databases, I developed principal component analysis and odds ratio test based methods to identifying tagging SNPs from the genome, and applied these methods to experimental datasets with different patterns of LD. The evaluation results shows that the tagging SNPs identified are the ones potentially useful for genotype-phenotype association studies, yet they are also the ones with strong ability to distinguish individuals.

Keywords/Search Tags:

Data, SNP, Disclosure risk, Public

Related items

1	Research On The Improvement Of Government's Public Service Capacity Under The Background Of Big Data
2	Research On Risk Prevention And Control Of Public Emergencies In Shanghai Under The Background Of Big Data
3	Application Of Value-at-Risk Measures On Evaluation Credit Risk Based On Data Mining Technology
4	Research On The Risks And Countermeasures Of China's Government Data Opening
5	The Research Of Urban-disasters Composite Risk Assessment Data Management
6	Research On Truthfulness Of Data Disclosure In Patent Specification Of Pharmaceutical & Chemical Fields
7	A Study Based On The Big Data Of The Construction Of So Cial Stability Risk Adversarial Assessment Mode In Major Admi Nistrative Policy Decision
8	Research On Risk Early-warming Management Of Crowd Gathering In Urban Public Place In The Big-data Age
9	Research On Data Quality Of Public Consumption Budget Information Disclosure
10	Research On The Electronic Data Disclosure System In Civil Litigation