Finding human genetic variation in whole genome expression data with applications for “missing” heritability: The GWCoGAPS algorithm, the PatternMarkers statistic, and the ProjectoR package

Abstract
Starting from a single fertilized egg, the compendium of human cells is generated via stochastic perturbations of earlier generations. Concurrently, canalization of developmental pathways limits the type and degree of variation to ensure viability; thus, it is unsurprising that deviations early in life have been linked to late manifesting diseases. Human pluripotent stem cells (hPSCs) are a highly robust and uniquely human experimental system in which to model the sources and consequences of this variability. Further, variation in hPSCs’ transcriptomes has been directly linked to both genomic background and biases in differentiation efficiency. Taking advantage of this link between genomic background and developmental phenotypes, we developed Genome-Wide CoGAPS Analysis in Parallel Sets (GWCoGAPS), the first robust whole genome Bayesian non-negative matrix factorization (NMF), to find conserved transcriptional signatures representative of the functional effect of human genetic variation. Time course RNA-seq data obtained from three human embryonic stem cells (ESC) and three human induced pluripotent stem cells (IPSC) in three different experimental conditions was analyzed. GWCoGAPS distinguished shared developmental trajectories from unique transcriptional signatures of each of the cell lines. Further analysis of these “identity” signatures found they were predictive of lineage biases during neuronal differentiation. Additionally, lineage biases were consistent with early differences in morphogenetic phenotypes within monolayer culture, thus, linking transcriptional genomic signatures to stable quantifiable cellular features. To test whether the cell line signatures were genome specific, we next developed the projectoR algorithm to assess a given signatures robustness in independent data sets. By using the identity signatures as inputs to projectoR, we were able to identify samples from the same donor genome in datasets from multiple tissues and across technical platforms, including RNA-seq results from post-mortem brain, micro arrayed embryoid bodies, and publicly available datasets. The identification of signatures that define the functional rather than physical background of an individual’s genome has the potential to profoundly influence our view of human variation and disease.
Description
Keywords
NMF, Transcriptomics, development, human Embryonic Stem Cells, projectoR, CoGAPS, GWCoGAPS
Citation