Expanding the computational biologist’s toolkit: Experimental design and multi-modality in genomics

Dumitrascu, Bianca

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rn301425t

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Engelhardt, Barbara E	-
dc.contributor.author	Dumitrascu, Bianca	-
dc.contributor.other	Quantitative Computational Biology Department	-
dc.date.accessioned	2019-11-05T16:46:50Z	-
dc.date.available	2019-11-05T16:46:50Z	-
dc.date.issued	2019	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp01rn301425t	-
dc.description.abstract	The traditional biological research pipeline consists of three steps: hypothesis generation, data collection, and data analysis. Data analysis is sometimes followed by a readjustment in hypothesis assessment, allowing for an iterative approach to the scientific inquiry. With the decreasing costs of data collection in high-throughput genomics, and with the increasing number of groups pursuing interconnected ques- tions, several experimental design challenges emerge. In this work, we address three experimental challenges motivated by advances in single-cell RNA-seq (scRNA-seq) technologies: budget allocation, marker selection and multi-modal data aggregation. First, we develop a novel heuristic for contextual bandit problems with logistic rewards and we show a new, bandit-inspired application to iterative experimental design in multi-tissue single-cell RNA-seq (scRNA-seq) data. We present two algorithms, a Good-Toulmin like estimator via Thompson sampling and a Pitman-Yor prior based approach with near optimal performance. Given a budget and modeling cell type information across tissues, they both estimate how many cells are required for sampling from each tissue with the goal of maximizing cell type discovery across samples from multiple iterations. Second, we consider the problem of marker selection in the context of multi modal data collection. Single-cell data analysis allows for the clustering of cells according to their genomic functionality as represented by their gene expression profiles. Such clustering can be achieved using a variety of methods and an active collaboration between experimentalists and computational groups. However, gene expression provides only one facet in depicting cell identity. Motivated by the emerging imaging technologies we present methods for selecting cluster and cluster hierarchy preserving subsets of marker genes that can optimize the imaging of population of cells. Finally, we employ tools from transfer learning to propose a generative model which aggregates information across multiple biological modalities: gene expression and histological sides. The model is a novel take on deep probabilistic canonical correlation analysis which allows for the joint mapping from gene space to morphology and from morphology to gene space, along with an interpretable latent space structure which we further evaluate through quantitative trait loci (QTL) analysis.	-
dc.language.iso	en	-
dc.publisher	Princeton, NJ : Princeton University	-
dc.relation.isformatof	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: <a href=http://catalog.princeton.edu> catalog.princeton.edu </a>	-
dc.subject	experimental design	-
dc.subject	genomics	-
dc.subject	single cell sequencing	-
dc.subject	transfer learning	-
dc.subject.classification	Biostatistics	-
dc.subject.classification	Computer science	-
dc.title	Expanding the computational biologist’s toolkit: Experimental design and multi-modality in genomics	-
dc.type	Academic dissertations (Ph.D.)	-
Appears in Collections:	Quantitative Computational Biology

Files in This Item:

File	Description	Size	Format
Dumitrascu_princeton_0181D_12932.pdf		29.72 MB	Adobe PDF	View/Download

Show simple item record

Search

Browse