Please use this identifier to cite or link to this item:
http://arks.princeton.edu/ark:/88435/dsp014q77fv17f
Title: | Clustering Glioblastoma Tumor Subpopulations: Different Approaches to Combining Cellular Single Nucleotide Polymorphisms and Gene Expression Data |
Authors: | Singh, Devina |
Advisors: | Raphael, Ben |
Department: | Computer Science |
Class Year: | 2019 |
Abstract: | The heterogeneity of tumor subpopulations poses significant problems to current cancer diagnosis and treatment techniques (8). Identification of specific tumors plays a crucial role in preventing drug resistance and reducing the risk of treatment failure (8). This is especially important when dealing with Glioblastoma brain cancer where tumors are enriched for distinct phenotypic properties and the presence of different tumors correlates to different clinical outcomes and patient life expectancies (8). In this paper, we develop three different approaches to combining cellular single nucleotide polymorphism (SNP) and gene expression data with the aim of 1) clustering Glioblastoma cells by the individuals they come from, 2) clustering each individual's Glioblastoma cells by tumor subpopulations and 3) examining the differential expression of genes across tumor subpopulations. We find that by appending the mean of each cell's gene expression features as a column to its SNP features, we see a significant improvement in the ability to cluster cells by individual and the Adjusted Mutual Information Score of these cluster labels. Furthermore, combining the SNP and gene expression data results in tumor subpopulation clusters with a much higher silhouette score than using just one type of data. The tumor subpopulations that arise for each individual are further examined and the BCCR, APLN and BUB1 genes which are commonly targeted as part of Glioblastoma therapeutic techniques are found to be among the ten most differentially expressed genes. Lastly, we benchmark the effect of allelic dropout on our ability to predict a) cluster labels and b) the number of k clusters to use by simulating different dropout rates on both simulated SNP data and SNP data taken from the 1000 Genomes project, with the aim of understanding how the different dropouts that arise from different sequencing technologies affect prediction outcomes. |
URI: | http://arks.princeton.edu/ark:/88435/dsp014q77fv17f |
Type of Material: | Princeton University Senior Theses |
Language: | en |
Appears in Collections: | Computer Science, 1988-2020 |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
SINGH-DEVINA-THESIS.pdf | 2.49 MB | Adobe PDF | Request a copy |
Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.