High-dimensional methods to model biological signal in genome-wide studies

Bass, Andrew Jay

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/99999/fk41n9k098

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Storey, John D
dc.contributor.author	Bass, Andrew Jay
dc.contributor.other	Quantitative Computational Biology Department
dc.date.accessioned	2021-10-04T13:27:26Z	-
dc.date.available	2021-10-04T13:27:26Z	-
dc.date.created	2021-01-01
dc.date.issued	2021
dc.identifier.uri	http://arks.princeton.edu/ark:/99999/fk41n9k098	-
dc.description.abstract	Recent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science. Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs. The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results. Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest. Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.publisher	Princeton, NJ : Princeton University
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu>catalog.princeton.edu</a>
dc.subject	False discovery rates
dc.subject	Latent variable models
dc.subject	Optimal discovery procedure
dc.subject	Population structure
dc.subject	Statistical inference
dc.subject.classification	Biostatistics
dc.title	High-dimensional methods to model biological signal in genome-wide studies
dc.type	Academic dissertations (Ph.D.)
pu.date.classyear	2021
pu.department	Quantitative Computational Biology
Appears in Collections:	Quantitative Computational Biology

Files in This Item:

File	Size	Format
Bass_princeton_0181D_13886.pdf	11.37 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse