Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/99999/fk41n9k098
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorStorey, John D
dc.contributor.authorBass, Andrew Jay
dc.contributor.otherQuantitative Computational Biology Department
dc.date.accessioned2021-10-04T13:27:26Z-
dc.date.available2021-10-04T13:27:26Z-
dc.date.created2021-01-01
dc.date.issued2021
dc.identifier.urihttp://arks.princeton.edu/ark:/99999/fk41n9k098-
dc.description.abstractRecent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science. Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs. The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results. Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest. Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.publisherPrinceton, NJ : Princeton University
dc.relation.isformatofThe Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu>catalog.princeton.edu</a>
dc.subjectFalse discovery rates
dc.subjectLatent variable models
dc.subjectOptimal discovery procedure
dc.subjectPopulation structure
dc.subjectStatistical inference
dc.subject.classificationBiostatistics
dc.titleHigh-dimensional methods to model biological signal in genome-wide studies
dc.typeAcademic dissertations (Ph.D.)
pu.date.classyear2021
pu.departmentQuantitative Computational Biology
Appears in Collections:Quantitative Computational Biology

Files in This Item:
File SizeFormat 
Bass_princeton_0181D_13886.pdf11.37 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.