Cellular identity between generations of developing cells is propagated through the epigenome particularly via the accessible parts of the chromatin. It is now possible to measure chromatin accessibility at single-cell resolution using single-cell assay for transposase accessible chromatin (scATAC-seq), which can reveal the regulatory variation behind the phenotypic variation. However, single-cell chromatin accessibility data are sparse, binary, and high dimensional, leading to unique computational challenges. To overcome these difficulties, we developed PRISM, a computational workflow that quantifies cell-to-cell chromatin accessibility variation while controlling for technical biases. PRISM is a novel multidimensional scaling-based method using angular cosine distance metrics coupled with distance from the spatial centroid. PRISM takes differences in accessibility at each genomic region between single cells into account. Using data generated in our lab and publicly available, we showed that PRISM outperforms an existing algorithm, which relies on the aggregate of signal across a set of genomic regions. PRISM showed robustness to noise in cells with low coverage for measuring chromatin accessibility. Our approach revealed the previously undetected accessibility variation where accessible sites differ between cells but the total number of accessible sites is constant. We also showed that PRISM, but not an existing algorithm, can find suppressed heterogeneity of accessibility at CTCF binding sites. Our updated approach uncovers new biological results with profound implications on the cellular heterogeneity of chromatin architecture.
One of the great mysteries in developmental biology is how the same genome can be read by
cellular machinery, giving rise to the plethora of different cell types required for eukaryotic
life
Despite the inherently repressive state of the chromatin
making most of the genome unreadable, in every cell type, a
small segment is organized into cis-regulatory modules, known
as promoters and enhancers, controlling the transcriptional
output of the cell
Because the conventional population-average epigenomic measurements rely on thousands to millions of cells, it is unclear if transcription factors act similarly across individual cells.
Although it is appreciated that gene expression is a fundamentally
stochastic process
The advent of single-cell genomics, which has enabled
unbiased profiling of the genetic and molecular states of
evergrowing number of individual cells, paved the way to address
these fundamental questions
The current leading method for measuring cell-to-cell
variation from scATAC-seq measurements, chromVAR
It then measures how much total accessibility differs from what
is expected by calculating a technical bias-corrected Z-score for
each cell. The standard deviation of these Z-scores constitutes
the cell-to-cell variation in chromatin accessibility. However, an
ensemble of cells can have similar total accessibility (i.e., number
of accessible sites in a cell) yet be accessible at completely different
regulatory elements. Thus, chromVAR is poorly equipped to
handle chromatin accessibility variation in certain cases as stated
in the original paper
Here, we present PRISM, an R package for calculating cell-tocell variation in chromatin accessibility using cosine similarity.
Instead of measuring variation in total accessibility between two cells, PRISM measures whether two cells are accessible at the same set of regulatory elements using angular cosine distance. It then exploits principal coordinate analysis to measure how much each cell differs from the group norm for chromatin accessibility.
Here, we demonstrate that PRISM outperforms chromVAR on various simulations when total accessibility is not varied, when signal is low, or when technical noise is high, in addition to real biological data. Together, PRISM can be used to construct a global and high-resolution view of epigenomic regulation in development and disease.
PRISM requires the scATAC-seq data to be binarized such that accessible genomic regions are scored 1 and inaccessible ones are scored 0. Every cell is plotted as a vector with coordinates given by the binary representations of accessibility at a pre-defined set of genomic regions unified by a common characteristic – for example, binding events of a transcription factor measured by ChIP-seq or sequences enriched for transcription factor binding motifs (Figure 1). We then evaluate the difference in chromatin accessibility between a pair of cells at these genomic regions by calculating the cosine distance between every two vectors. More specifically, where A and B are binary accessibility vectors, the angular cosine distance is calculated by Equation (1), which can be seen as taking the angle between two vectors and dividing it by a normalizing factor of π/2:
Cosine distance (A, B) = (1) cos−1 ||A|A| ·B||B|| π/2
We calculate how different every cell is from the group norm and center the cosine distance matrix by subtracting column and row means while adding the overall mean. We then spectrally decompose the centered cosine distance matrix to define principal coordinates, mapping vectors of chromatin accessibility to full principal coordinate space, and identifying the vectors’ centroid (Figure 1). Finally, the average distance of all vectors to the centroid is used by PRISM as a measure of variability of chromatin accessibility among individual cells at the genomic regions of interest.
The above formulation Equation (1) can be used to compare the impacts of different transcription factors in terms of inducing or reducing stochasticity of chromatin accessibility across individual cells. However, inherent differences among genomic subsets such as the GC content or average accessibility can also introduce technical variations. To overcome such limitations and normalize for the GC contents, we calculate variability as described in Equation (1) at a set of “normalizing background peaks†with GC contents comparable to the genomic set of interest. The normalizing background peaks are selected randomly from the set of genomic regions lacking the feature of interest based on their GC contents. The process of background peak selection is repeated for a user-defined number of times and the mean of variability for multiple sets of background peaks is used as a normalization factor. All background peaks are also selected to be within ±0.01 of the overall mean accessibility of the original peaks.
We controlled for technical biases as follows: Bias corrected variation =
Original variation Mean (background variations) (2)
To measure accessibility variation beyond GC content, we calculate accessibility variation for a user-defined number of randomly selected subsamples of peaks lacking the genomic feature of interest, for example, regions not bound by a transcription factor while correcting for technical biases in the control peak set. Each subsample has the same number of peaks as the original peak set. The negative control variations are further used to generate Z-scores and p-values for the observed variation. The final variation is calculated by PRISM as:
Final variation =
Bias corrected variation Mean (biased corrected negative control variations) (3)
PRISM’s variability equal to 1 implies that the feature of interest for example a transcription factor or enrichment of a certain motif is associated with no more variation than negative control. PRISM’s variability below 1 implies that the feature of interest is associated with less variation than negative control, and variability above 1 implies greater variation than negative control. It is worth to point out that the idea behind PRISM is to assess the difference between cells at transcription factor-bound sites compared with the unbound regions (or regions lacking the genomic feature of interest). However, the two sets are not required to have identical GC-contents. To address this issue, the transcription factor-binding peak set and negative control peak set are each normalized for their GC-contents, which will require separate background peak selections.
We synthesized heterogeneity in chromatin accessibility across single cells following two models. In model 1, we assumed comparable levels of total accessibility between individual cells at a set of genomic regions. To recapitulate conditions favorable to chromVAR’s assumptions, we developed model 2 and incorporated differences in total accessibility among cells. To model the genomic regions of interest, we randomly selected 500 (or 1000) peaks from ∼50,000 open chromatin regions. Relying on the central limit theorem, the randomly selected original and GC-matched peaks in model 1 comprise comparable levels of average accessibility across individual cells. Simulated heterogeneity is then created by titrating the number of cells with the original or GC-matched peaks. This approach leads to a mixture of cells containing varying percentages of original versus GC-matched peaks (Figure 2A). The rationale to mix peaks rather than cells of different types is to prevent confounding factors such as differences in cell lysis affecting our assessment of cell-to-cell variability. We expect that when cells contain only original peaks (or GC-matched peaks), the variability should be at a minimum. In contrast, when half of cells contain original peaks, the variability should be at a maximum. Based on how mixing of cells is controlled, we expect an inverse-U or concave shape for our measure of variability, peaking at around 50–50 mixed peaks.
We further synthesize heterogeneous data using model 2 relying on the same approach except that GC-matched peaks are drawn from peaks with greater than 75th percentile in mean accessibility compared to all other peaks. In other words, model 2 assumes the presence of a significant difference in total accessibility between cells underscoring a larger degree of cellular heterogeneity. We further augmented each model to have subtypes A and B such that subtype A utilized cells with highest accessibility in contrast with subtype B that relied on lowest accessible cells. While subtype A is the most robust measurement and reflects an ideal sequencing coverage, subtype B tests the method’s sensitivity to technical noise. Together, model 1 is built such that heterogeneity is not caused by differences in total accessibility between cells, simulating cases where an ensemble of cells contains comparable total accessibility levels across the genome but accessibility can occur at completely different regulatory elements (Supplementary Figure S1). On the other hand, model 2 aims to simulate cases where a major difference exists in the total accessibility of cells at genomic regions of interest (Supplementary Figure S1).
In order to see how well a simulation result fits an
inverseU shape (concave curve), a test of concavity was designed.
The metric, referred to as U statistic, measures the difference
between variability of successive proportions of cells containing
the original peaks. Then the Spearman correlation of this
ordering with the decreasing numbers 49 through 1 was
calculated. This procedure checks if the derivative (slope) is
continuously decreasing. Using this definition, test of concavity
values (U statistic) close to 1 are ideal. We also measured each
algorithm’s mean squared error (MSE) from its local polynomial
regression (LOESS) curve. This assessed the degree to which an
algorithm was susceptible to random fluctuations or noise. We
further compared the values of variability across the simulated
heterogeneous data as measured by chromVAR
The raw files for scATAC-seq data were generated in mouse
forebrain tissue at E11.5 (GSM2668117)
We introduce PRISM for estimating cell-to-cell variation at the level of chromatin accessibility. Unlike chromVAR, PRISM infers cell-to-cell variability by capturing differences in accessibility of every genomic region of interest across individual cells. It is possible that a comparable total number of accessible sites occur in a cell, yet accessibility happens at non-overlapping regulatory elements (Supplementary Figure S1). Our algorithm takes binarized read counts of individual cells and calculates variation of open chromatin across single cells at DNA sequences unified by an annotation such as transcription factor binding events or enrichment of motifs (i.e., a set of peaks). PRISM then represents each cell in a high-dimensional space as a vector (Figure 1). Each coordinate in the vector corresponds to accessibility of a regulatory element. To measure how different any two cells are at a given group of genomic regions, PRISM calculates the pairwise angular cosine distance or the angle between two vectors (Figure 1). The pair-wise differences between cells are then used to measure how different every cell is from the group “averageâ€: Each cell is plotted as a point in principal coordinate space such that the Euclidean distance between two points (cells) is equal to the original angular cosine distance between two vectors. PRISM then finds the centroid of this unique point configuration. Every point’s distance from the centroid is calculated, and then these distances are averaged. This can be seen as each cell’s distance from the group norm for chromatin accessibility and constitutes our measure of cell-to-cell variation prior to technical bias correction (Figure 1A). Our proposed method scales linearly with heterogeneity, in contrast to average angular cosine distance.
To account for technical biases, a user-defined number of
“background†sets of peaks are identified for every set of genomic
regions. The background peak sets are matched for peak number,
overall mean accessibility, and peak-for-peak GC content to
the original peak set. Using the procedure outlined above,
accessibility variation for each background peak set is calculated.
The variations of the background sets are then averaged. To
obtain the bias-corrected variation, the variation of the original
peak set is divided by the average variation of the background
peak set with matching mean accessibility and GC content. After
correcting for technical biases, a negative control is developed:
a user-defined number of sets of peaks are randomly selected,
each with equal peak number to the original peak set. The
bias-corrected variation of each negative control peak set is
calculated. Then the bias-corrected variation of the original peak
set is divided by the average of the user-defined number of
negative control peak sets. This measures cell-to-cell variation in
chromatin accessibility in units of background noise. A calculated
variation equal to 1 implies that a chromatin feature is associated
with equal variation to background noise. Together, unlike the
previously proposed method chromVAR
To benchmark the performance of PRISM and chromVAR, we
simulated heterogeneity in single-cell chromatin accessibility
data. We exploited three independent publicly available datasets
generated by two leading experimental protocols, that is, the
automated microfluidic platform
We calculated variability of chromatin accessibility at the
single-cell level across the simulated sets of peaks using PRISM
and chromVAR. In data generated by combinatorial indexing
technique in thousands of cells in mouse forebrain tissue
We next compared the predictions of PRISM and chromVAR
on the effect of 139 transcription factors using real transcription
factor binding data. Assessing cell-to-cell variability using the
two methods revealed different predictions for 17 transcription
factors in K562 cell line (Figure 3A). Among transcription
factors that were reported differently between two methods,
chromVAR but not PRISM inferred that CTCF binding events
in K562 cell line could increase cell-to-cell variability at the
chromatin accessibility level (Figure 3B). However, numerous
studies mapping the genome-wide binding events of CTCF
at the population level across a wide variety of tissues
have shown the cell-type-invariant binding of this protein
acting as an insulator, supporting PRISM’s prediction
Since the discovery of the cell by Robert Hooke in 1665,
biologists and pathologists have been fascinated by the diversity
of cell types in our body. With the advent of molecular cell
biology, methods such as fluorescent protein reporters and
single-molecule detection of RNA or DNA have been developed
for measuring properties and functions of single cells at
increasing resolution
To evaluate the performance of PRISM and compare it with chromVAR in assessing variability of chromatin accessibility across single cells, we devised a computational experiment and generated heterogeneity in chromatin accessibility by shuffling scATACseq data. Our framework generated simulated scATAC data from the real measurements with various degrees of heterogeneity, which were further used to evaluate the performance of PRISM. While variability evaluated by PRISM increased as heterogeneity increased, chromVAR failed to perform in cases where heterogeneity existed between peaks and across cells as a result of aggregating signal across all peaks (in particular model 1). We further showed that PRISM but not chromVAR can predict CTCF binding events to associate with low level of variability across individual cells.
We have shown that our method, named PRISM, is able to overcome the obstacles in analyzing single-cell ATAC-seq, caused by the inherent nature of such assays, and provide a robust framework that assesses the effects of transcription factor binding on the chromatin accessibility, at the single-cell level. When compared to the state-of-the-art method, named chromVAR, we have shown that PRISM facilitated the discovery of lineage-determining transcription factors with the ability to preserve low variability of chromatin accessibility at the singlecell level.
The datasets GSE99159 for this study can be found in the NCBI GEO. PRISM is an open source framework, freely accessible through Github (https://github.com/VahediLab/PRISM).
All authors contributed extensively to the work presented in this paper. GV conceived the project, administered the analyses, and wrote the manuscript. SC developed and implemented the method. GG and JJ provided technical support and conceptual advice.
This study was funded by NIH, Grant No. K22AI112570, and Sloan Foundation grants to GV.
We thank the sequencing facility at the Penn Vet Center for HostMicrobial Interactions. We also thank members of the Vahedi and Faryabi labs particularly Dr. R. Babak Faryabi as well as Gregory Schwartz and Yeqiao Zhou for discussions.
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00319/full#supplementary-material FIGURE S1 | Chromatin accessibility variation can exist even if total accessibility is the same between cells. Measurements of an ensemble of cells can show similar total level of accessibility within cells (i.e., comparable number of accessible sites in a cell) yet accessibility can occur at non-overlapping regulatory elements. In this hypothetical case, red represents an accessible sequence in a given cell, and blue represents an inaccessible sequence. Each cell has three accessible sequences (red) total, or a total accessibility of 3. Thus, the total accessibility is the same between cells. But each cell is accessible at completely different DNA sequences, which may have different functions. Existing algorithms, such as chromVAR and Buenrostro et al.’s (2015) workflows, are built on the standard deviation of total accessibility, hence they cannot measure variation in these cases.
FIGURE S2 | PRISM outperforms chromVAR for multiple values of peak number for background peaks. PRIMS outperforms chromVAR when 40 or 50 background peaks are selected in calculating variability in mouse forebrain tissue, mouse double-positive T cells and human AML cells.
FIGURE S3 | PRISM outperforms chromVAR under subtype B when cells with low chromatin accessibility are selected. PRISM outperforms chromVAR under subtype B when cells with low chromatin accessibility are selected in mouse double-positive T cells and human AML cells. Conflict of Interest Statement: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Copyright © 2018 Cai, Georgakilas, Johnson and Vahedi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.