Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp016h440v94d
Title: Detecting gene similarities using large-scale content-based search systems
Authors: Zhu, Qian
Advisors: Troyanskaya, Olga G
Contributors: Computer Science Department
Keywords: big data
coexpression
data integration
functional genomics
gene expression
meta-analysis
Subjects: Computer science
Bioinformatics
Issue Date: 2016
Publisher: Princeton, NJ : Princeton University
Abstract: The accumulation of public gene expression datasets offers numerous opportunities for researchers to utilize these data to characterize gene functions, understand pathway actions, and formulate hypotheses about the molecular basis of human diseases. Yet, exploring this extremely large gene expression data collection has been challenging, due to a lack of effective tools in reusing existing datasets and exploring these datasets for targeted analyses. Particularly, a critical challenge is discovering robust gene signatures of biological processes and diseases, where this depends on the ability to detect similar genes that share gene expression patterns across a large set of conditions. This thesis discusses query-based systems that are intended for large-scale integration and exploration of gene similarities, and discusses their key biological applications. In the first part, I present SEEK, a search system and a novel algorithm for searching similar (or coexpressed) genes around a multigene query of interest. The search algorithm combines coexpressed genes using a sensitive dataset weighting algorithm for effective weighting of coexpression results. Notably, through the robust search of thousands of human datasets, the retrieval of functionally co-annotated genes always improves with the inclusion of more datasets, showing the promise of the large compendia. In the second part, I extend the work of SEEK to the expression compendia of 5 commonly studied model organisms. The new system ModSEEK enables accurate searches in a wider experimental variety, and has been extensively evaluated. In the third part, I propose a novel framework for integrating and comparing coexpression context across a pair of organisms. I leverage both comparative genomics orthology data and functional genomics coexpression data, in an unsupervised framework to identify pairs of genes in an orthologous group that are similarly highly coexpressed to an orthologous query in two organisms. I show that such functionally similar pairs of genes can be used to improve the performance of single-organism gene retrieval searches. In the final part, I demonstrate how coexpressed genes can be used to identify important transcription factors and dysregulated processes underlying breast cancer subtypes. This part highlights the promise of coexpressed genes in providing an understanding of cancer dysregulations.
URI: http://arks.princeton.edu/ark:/88435/dsp016h440v94d
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Computer Science

Files in This Item:
File Description SizeFormat 
Zhu_princeton_0181D_11945.pdf4.78 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.