Statistical learning and data mining in biological databases

Posted on:2010-03-29

Degree:Ph.D

Type:Thesis

University:Stevens Institute of Technology

Candidate:Kim, Hyunjae Ryan

Full Text:PDF

GTID:2448390002976019

Subject:Biology

Abstract/Summary:

PDF Full Text Request

This thesis explores (i) the feasibility of using communication theory models to understand the protein synthesis process from gene to protein, (ii) to find the genetic error control mechanism using error correcting coding theory and (iii) detecting diseases related genetic errors using statistical learning methods on biological databases i.e., EST(Expressed Sequence Tag) and SNP(Single Nucleotide Polymorphism). Several statistical tests are proposed and tested over various biological data. These include the CUSUM (Cumulative Sum) detection for abrupt changes in a stochastic process, SVD(Singular Value Decomposition) for dimensionality reduction and HMM-SVM(Hidden Markov Model-Support Vector Machine). We propose new disease diagnosis systems based on Gene Variation Analysis. The system consist of Pre-Processing, Similarity Search and clustering by EST analysis and disease analysis by SNP classification. Pre-processing reduces the overall noise (vector contamination, low complexity region, repeats) in EST data to improve the efficacy of subsequent analysis. EST clustering and assembly using CAP3 sequence assembly is used to collect overlapping ESTs from the same transcript to reduce redundancy. The assembled EST called Consensus EST sequences are merged based on clone-identification data to obtain the best putative gene representation. Detailed test results on several biological databases are used to draw key conclusions about the proposed mathematical analyses.

Keywords/Search Tags:

EST, Data, Biological, Gene, Statistical, Using

PDF Full Text Request

Related items

1	Statistical modeling of genomic data: Applications to genetic markers and gene expression
2	Clustering Algorithm Based On Biological Knowledge And Its Application On Gene Expression Data
3	Computational approaches for biological data analysis
4	Statistical hypothesis testing and application to biological data
5	Data integration methods for systems-level investigation of gene functional association networks
6	Study On SVMs-based Classification Of Gene Expression Data
7	Gene set enrichment and projection: A computational tool for knowledge discovery in transcriptome
8	Keeping pace with the times: Quantifying variation of newly emerging biological shape data
9	Inferring Biological Knowledge of Pathways from an Ontology Fingerprint-derived Gene Network
10	The Analysis Of Network Topology Structure Of Biological Data