Accounting for Population Structure
in Lasso Regression for Genome-Wide
Association Studies

Steele, Hannah

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01pg15bd98b

Title:	Accounting for Population Structure in Lasso Regression for Genome-Wide Association Studies
Authors:	Steele, Hannah
Advisors:	Storey, John
Contributors:	Fan, Jianqing
Department:	Operations Research and Financial Engineering
Class Year:	2013
Abstract:	This thesis utilizes lasso regression to model a quantitative trait on a large number of SNP predictors simultaneously, which can provide more promising results than single-SNP analyses in genome-wide association studies. Because latent population structure within genotype data can confound the results of an association study, our model will account for sample structure via a recently developed method known as logistic factor analysis, which allows us to adjust genotype data at each SNP based on the inferred genetic structure of each individual. We then introduce the principal components of the original genotype data into the regression model in order to isolate associations between the phenotype and the population structure of each individual separately from the genetic effects of the SNPs. In all of our simulations, models formed by our regression method demonstrate more consistent predictiveness among training and test data sets than models formed by regression on the raw genotype data. For simulations in which population structure is associated with the trait through non-genetic factors, our regression method o ers an improvement over regression on the raw genotypes in terms of predicting trait variation in new test data, making the prediction model more robust to sample structure at the expense of explaining less variation in the training data on which the model is formed. When sample structure is only associated with the trait through the SNPs themselves, our regression method performs slightly worse in predicting phenotypes in the presence of larger causal SNP effects on the trait, and slightly better in the presence of weaker genetic effects, since it is able to access the aggregate genetic effects through the principal components. In all of the presented simulation scenarios, our regression method tends to produce greater model shrinkage and higher precision in identifying true positives.
Extent:	110 pages
URI:	http://arks.princeton.edu/ark:/88435/dsp01pg15bd98b
Access Restrictions:	Walk-in Access. This thesis can only be viewed on computer terminals at the Mudd Manuscript Library.
Type of Material:	Princeton University Senior Theses
Language:	en_US
Appears in Collections:	Operations Research and Financial Engineering, 2000-2019

Files in This Item:

File	Size	Format
Steele Hannah final thesis.pdf	2.4 MB	Adobe PDF	Request a copy

Show full item record

Search

Browse