MINIMAL PANELS OF RNA MARKERS FOR CELL TYPES USING SINGLE-CELL DATA

Embargo until
2026-05-01
Date
2022-03-18
Journal Title
Journal ISSN
Volume Title
Publisher
Johns Hopkins University
Abstract
Single-cell RNA sequencing technologies provide measurements of the number of RNA molecules in many thousands of individual cells, a rich source of information for determining attributes of cell populations, such as cell types and the variation in gene expression from cell to cell, which are not available from bulk RNA sequencing data [1–5]. A core challenge in the analysis of sc-RNA seq data is to find “marker genes” for some class of cells, e.g., cell type. Another challenge is to describe, let alone quantify, how the individual marker genes cooperate to determine cell labels. Generally, most existing methods of scRNA-seq analysis are at the univariate (single gene) level even though the relevant biology is often decidedly multivariate. In this thesis we introduces a method that formulates marker gene selection as a variation of the well-known “minimal set-covering problem” in combinatorial optimization. Here, the “covering” elements are genes and the objects to be covered are a sub-population of cells with a particular label k. In order to draw this link between marker panels and set coverings, we binarize the raw mRNA counts into “expressed” (positive count) or “not expressed” (zero count). The resulting paradigm, based on covering a target class, differs fundamentally from most standard approaches, in which optimal panels are determined by optimizing their weights with a fixed panel size. In addition to enabling the link to set covering, binarization facilitates the biological interpretation of marker genes and the manner in which they characterize and discriminate among types of cells. Using the covering paradigm, we can predict cell types or transfer marker panels to identify shared cellular processes across data sets in related biological contexts using extremely transparent discriminants, such as the number of expressed panel genes. We illustrate this new methodology in the context of neocortical neurogenesis during mid-gestation when the vast majority of neurons in the brain are produced. To further investigate some basic properties of covering marker panels, we also discuss the stability of covering marker sets, as well as the gene interactions within a marker set. Some generalizations and extensions of the covering algorithm are also introduced. We also present a semi-supervised learning version of marker panel construction when cell labeling is incomplete or some marker genes are known. Finally, we introduce a marker panel based on pairs of genes which characterizes the transitions between cell states.
Description
Keywords
Marker, Single-cell, Covering, Optimization, Machine Learning
Citation