Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data

Chung, Neo Christopher  Honghoon

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rv042w30x

Title:	Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data
Authors:	Chung, Neo Christopher Honghoon
Advisors:	Storey, John D
Contributors:	Quantitative Computational Biology Department
Keywords:	data jackstraw latent variable model principal component analysis resampling sparse pca
Subjects:	Biostatistics Bioinformatics Statistics
Issue Date:	2014
Publisher:	Princeton, NJ : Princeton University
Abstract:	Modern genomic technologies collect an ever-increasing amount of information (e.g., gene expression and genotypes) about model organisms and humans. Systematic patterns of variation in such large-scale biological studies reflect the underlying molecular signatures of disease status, environment, and others, and can be quantified using principal component analysis (PCA) and related methods. For example, histological examination of tumor cells has long provided clinical classifications of cancer which are indirect, imprecise, and low-resolution. In contrast, we can infer different types of cancer directly from gene expression profiles of cancerous tumor samples. An unsolved problem in this context is how to systematically identify the observed variables that are drivers of systematic variation captured by PCA. My dissertation introduces a statistical framework to rigorously utilize a quantitative characterization of systematic variation. The key challenge in utilizing latent variable estimates -- such as principal components (PCs) -- is how to prevent overfitting. It is well established that conventional statistical tests for association using quantities estimated from the data itself will artificially inflate statistical significance, because the data is used twice. We introduce a general resampling approach, called the jackstraw, to calculate statistical significance of association between the observed variables and their latent variables, while automatically adjusting for how much PCA overfits the particular dataset. Furthermore, based on weights derived from the jackstraw, we developed significance-based shrinkage methods for the loadings of PCs and high-dimensional covariance matrices, called the jackstraw weighted shrinkage. Incorporating this set of proposed methods, we investigated genetic differentiation due to the global human population structure. Overall, the proposed statistical framework makes minimal assumptions and offers flexibility in exploring and analyzing the data, while providing a safeguard against an anti-conservative bias due to overfitting.
URI:	http://arks.princeton.edu/ark:/88435/dsp01rv042w30x
Alternate format:	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Quantitative Computational Biology

Files in This Item:

File	Description	Size	Format
Chung_princeton_0181D_11068.pdf		6.24 MB	Adobe PDF	View/Download

Show full item record

Search

Browse