Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp01rv042w30x
Title: | Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data |
Authors: | Chung, Neo Christopher Honghoon |
Advisors: | Storey, John D |
Contributors: | Quantitative Computational Biology Department |
Keywords: | data jackstraw latent variable model principal component analysis resampling sparse pca |
Subjects: | Biostatistics Bioinformatics Statistics |
Issue Date: | 2014 |
Publisher: | Princeton, NJ : Princeton University |
Abstract: | Modern genomic technologies collect an ever-increasing amount of information (e.g., gene expression and genotypes) about model organisms and humans. Systematic patterns of variation in such large-scale biological studies reflect the underlying molecular signatures of disease status, environment, and others, and can be quantified using principal component analysis (PCA) and related methods. For example, histological examination of tumor cells has long provided clinical classifications of cancer which are indirect, imprecise, and low-resolution. In contrast, we can infer different types of cancer directly from gene expression profiles of cancerous tumor samples. An unsolved problem in this context is how to systematically identify the observed variables that are drivers of systematic variation captured by PCA. My dissertation introduces a statistical framework to rigorously utilize a quantitative characterization of systematic variation. The key challenge in utilizing latent variable estimates -- such as principal components (PCs) -- is how to prevent overfitting. It is well established that conventional statistical tests for association using quantities estimated from the data itself will artificially inflate statistical significance, because the data is used twice. We introduce a general resampling approach, called the jackstraw, to calculate statistical significance of association between the observed variables and their latent variables, while automatically adjusting for how much PCA overfits the particular dataset. Furthermore, based on weights derived from the jackstraw, we developed significance-based shrinkage methods for the loadings of PCs and high-dimensional covariance matrices, called the jackstraw weighted shrinkage. Incorporating this set of proposed methods, we investigated genetic differentiation due to the global human population structure. Overall, the proposed statistical framework makes minimal assumptions and offers flexibility in exploring and analyzing the data, while providing a safeguard against an anti-conservative bias due to overfitting. |
URI: | http://arks.princeton.edu/ark:/88435/dsp01rv042w30x |
Alternate format: | The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog |
Type of Material: | Academic dissertations (Ph.D.) |
Language: | en |
Appears in Collections: | Quantitative Computational Biology |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Chung_princeton_0181D_11068.pdf | 6.24 MB | Adobe PDF | View/Download |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.