May 10.1101/2020.11.09.373613 Modeling fragment counts improves single-cell ATAC-seq analysis Laura D. Martens 0 2 3 David S. Fischer 3 4 Fabian J. Theis fabian.theis@helmholtz-muenchen.de 1 2 3 Julien Gagneur gagneur@in.tum.de 0 2 3 Department of Informatics, Technical University of Munich , Garching , Germany Department of Mathematics, Technical University of Munich , Garching , Germany Helmholtz Association - Munich School for Data Science (MUDS) , Munich , Germany Institute of Computational Biology, Helmholtz Center Munich , Neuherberg , Germany TUM School of Life Sciences Weihenstephan, Technical University of Munich , 85354 Freising , Germany 2022 4 2022

Single-cell ATAC-sequencing (scATAC-seq) coverage in regulatory regions is typically binarized as an indicator of open chromatin. However, the implications of scATAC-seq data binarization have never been assessed. Here, we show that the goodness-of-fit of existing models and their applications including clustering, cell type identification, and batch integration, are improved by a quantitative treatment of the fragment counts. These results have immediate implications for scATAC-seq analysis.

-

34 35 36 37 38 39 40 a loss function that captures the goodness-of-fit of a model to the data. Therefore, the data representation and the choice of the loss function are of prime importance for all those models. Due to overall data sparsity, and also because of a conceptualization of chromatin accessibility as a binary state, it is customary to binarize the read count matrix prior to machine learning modeling of scATAC-seq data4,739. While handling the data quantitatively has been done for some methods3,5,10,11, there is no systematic investigation of the advantage or disadvantages of binarizing the data.

Here we study the effects of binarization versus count-based models on scATAC-seq data modeling tasks. We based our analysis on two publicly available datasets with cell-type annotation and several batches: (i) a hematopoiesis dataset from Satpathy et al.12 consisting of flow-sorted and unsorted samples from bone marrow and blood, and (ii) a joint single-cell RNAseq and ATAC-seq bone marrow dataset from the NeurIPS 2021 Competition Track13. We first considered the proportion of peaks above the typical binarization threshold of one read. A substantial number of non-zero peaks had counts of two or even more than two reads (12% and 14.5% for the NeurIPS and Satpathy et al. dataset, respectively, Fig. 1c and Extended Data Fig. 1a). Interestingly, the majority of non-zero peaks had two reads rather than one. When studying the distribution of counts in more detail, we found that it had more peaks with even than uneven counts (Fig. 1c and Extended Data Fig. 1a). This pattern can be explained as an artifact of the count aggregation strategy. Within the 10X Genomics ATAC pipeline14 and also other pipelines3,5, the reads or fragment ends instead of the fragments themselves are counted because they mark transposase insertion events. Since scATAC-seq creates paired-end reads, this will mostly result in even counts, whereas uneven counts are only generated if one read pair lies outside the peak region (Fig. 1a,b).

Loss functions are typically based on the choice of a parametric distribution of the data. However, the alternating pattern of odd and even read counts does not suggest a straightforward parameterization. In contrast, fragment counts - estimated here from the existing pipelines by simply aggregating odd and neighboring even reads - showed a regular monotonic decay (Fig. 1d and Extended Data Fig. 1b; Online Methods). When modeling counts, the simplest description is given by the Poisson distribution, for which mean equals variance. We found that the variance of read counts for each region across cells was approximately twice the mean (Fig. 1e and Extended Data Fig. 1c), violating the Poisson assumption and consistent with the fact that reads typically come by two. In contrast, the mean-variance relationship of fragment counts was broadly consistent with a Poisson distribution (Fig. 1f and Extended Data Fig. 1d) across our two data sets and the entire dynamic range.

Altogether, these results have two implications: First, scATAC-data contains information beyond a binary accessibility state. Second, fragment counts, but not read counts, can be approximately modeled with the Poisson distribution. human gene RERE showing multiple insertions in a single cell. The tracks show, from top to bottom, the coverage of one batch used for peak calling, the aligned read pairs of a single cell, the peak region, and genome annotation. The peak region overlaps with five reads and three fragments. c) Read count distribution on the entire NeurIPS dataset. The y-axis is on a logarithmic scale. The striking odd/even pattern in read count distribution reflects that reads come in pairs and suggests that fragment counts, rather than reads, should be modeled. Inset: A pie chart showing the percentage of all non-zero peaks with 1, 2, or more than 2 reads. d) Distribution of the approximated fragment count does not show an even/odd pattern. e) Variance of read counts across cells against mean read counts. Each dot represents one peak region. When fragment ends (reads) are counted, the variance of read counts is abo ut twice the mean (grey dotted line), which is not consistent wi th a Poisson distribution (solid grey line). The y- axis and the x-axis 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 are shown on a logarithmic scale. f) Variance of fragment counts is approximately equal to the fragment count mean, consistent with a Poisson distribution (solid grey line).

Next, we investigated the effects of modeling the fragment counts rather than a binarized signal on downstream applications. We first investigated latent space learning of sc-ATAC-seq data, a key first step for cell type identification, batch integration and trajectory learning. To this end, we considered as baseline model PeakVI model, a state-of-the-art deep generative model for scATAC-data8. PeakVI models binarized ATAC data by learning a Bernoulli probability that a peak in a given cell is accessible. PeakVI corrects for cell-specific effects and region biases by learning cell-specific and region-specific factors. We adapted PeakVI to model Poisson-distributed data in a model we call Poisson VAE (Poisson variational autoencoder). We used the same architecture as PeakVI up to the last layer, where we modeled a Poisson rate parameter for a given peak region in a cell (Online Methods). The total number of fragments per cell varies drastically across cells. We included total fragment count as a pre-computed offset in the loss instead of learning a cell-specific factor as done in PeakVI. Similarly, we tested the effect of using the proportion of non-zeroes in the binary case (Binary VAE) by including it as an offset in the binary loss function (Online Methods).

We first assessed how well the models fitted the two studies. As a common ground, we benchmarked the models for predicting the presence of at least one read. As scores, we used the predicted probability of a region to be open for the binary models and used the predicted probability of non-zero counts for the quantitative models. In this setup, Poisson VAE significantly outperformed PeakVI in reconstructing binarized counts as measured by average precision (NeurIPS: adjusted P = 1.2 x 10 -7 and Satpathy et al.: adjusted P = 6.9 x 10 -8) (Fig. 2a). Most of the performance boost could be attributed to using the observed total fragment counts rather than predicting it since the binary model (Binary VAE) also showed significantly better reconstruction than PeakVI. In addition, the quantitative model Poisson VAE significantly outperformed the binary model Binary VAE (NeurIPS: adjusted P = 1.3e-06 and Satpathy et al.: adjusted P = 7.1e03).

We next investigated the implications of these improved fits for downstream applications, starting with batch integration. Since a good integration should reduce technical artifacts between batches while retaining biological variation, we evaluated several integration metrics as previously described15. Poisson VAE consistently improved batch integration while better retaining biological signals (Fig. 2b-d and Extended Data Fig. 2 and 3). For example, Poisson VAE better recovers the isolated label ID2-hi myeloid progenitors (only present in 2 of the 13 batches) as measured 121 122 123 124 125 126 127 128 129 130 131 132 133 by the improved isolated label F1 and isolated label silhouette. Lastly, we assessed the effect of binarization when predicting single-cell ATAC-seq data from matched scRNA-seq data. To this end, we compared to the winning algorithm of the NeurIPS Competition 2021 <Multimodal Single-Cell Data Integration=. We implemented two encoderdecoder models differing only by the loss, binary or Poisson, and in both using the total fragment counts as a given offset (Online Methods). The results are consistent with the above described ATAC-to-ATAC modeling. First, the Poisson-loss model performs significantly better than the Binary-loss model even though it is evaluated for binarized predictions (reconstruction of binarized scATAC-seq data on 10,000 random peaks as measured by root mean squared error, Fig. 2e). Second, a substantial performance boost is reached by using the observed ATAC-seq total fragment counts rather than predicting them from RNA-seq, as both our models largely outperformed the NeurIPS competition winner model (LS_labs), which however only had access to total counts of RNA-seq (Fig. 2e).

In conclusion, we found that scATAC-seq data can be treated quantitatively and that useful information is lost through binarization of the counts. Modeling DNA accessibility in single nuclei quantitatively, rather than as a binary state, is consistent with the fact that to access DNA, transcription factors, just like transposases, have to diffuse through the nucleus, likely reaching distinct chromosome territories and compartments with various efficiency (Fig. 2f). Also, a single genomic position in diploid cells is not necessarily open or close on both alleles synchronously. Moreover, DNA accessibility is not a static but a highly dynamic state. In fact, nucleosome turnover rates16 are in the order of magnitude of the incubation duration of the transposition reaction1. In parallel to our work and in agreement with these findings, a recent preprint showcased that gene expression increases with the number of fragments in accessible regions in single nuclei17. We showed that fragment counts, but not read counts, can be modeled with the Poisson distribution. We found the scATAC-seq total fragment counts to be of great importance for both the reconstruction from scATAC data and the prediction of scATAC data from scRNAseq data. Therefore, we recommend incorporating total fragment counts in models of scATACseq data - just as it is typically done for scRNA-seq data - as well as updating corresponding pipelines and benchmarks/competitions. These findings have immediate practical implications for modeling scATAC-seq data since using a Poisson loss over a binary loss comes at no noticeable increase in algorithmic complexity and computing time. Applications of this model to other settings such as multiome integration could be the subject of future work. 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

Methods Poisson VAE model

Let be a fragment count matrix over with a variational autoencoder: cells and peak regions. We model the counts

The encoder

encode the parameters of a multivariate random variable . is a neural network that maps the latent representation concatenated to the batch annotation back to the dimension of peaks. refers to the log-transformed total fragment counts per cell . captures a region-specific bias such as the mean fragment count or peak length and is learned directly.

is constrained to encode the mean distribution of reads over all peaks by using a softmax activation in the last layer. This means that The binarized signal was modeled as follows: We included the proportion of non-zeroes by modeling:

Here, is the logit function. This way is equal to the mean accessibility of the cell for .

as comparable as possible to PeakVI as implemented in scvi-tools8,18 (v.0.15.0), we use the same architecture. Specifically, these functions consist of two repeated blocks of fully connected neural networks with a fixed number of hidden dimensions set to the square root of the number of input dimensions, a dropout layer, a layer-norm layer, and leakyReLU activation. 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238

The last layer in the encoder maps to a defined number of latent dimensions

Adjusting the model to RNA input

To compare the Poisson loss to binary loss in the NeurIPS challenge, we adapted the abovedescribed Poisson VAE and binary VAE models to take scRNA-seq measurements as input. Specifically, they take a matched cells times genes matrix as input with being the total number of genes. In this case, the latent representation of a cell was derived from the RNA expression vector:

Training procedure

We used the default PeakVI training procedure with a learning rate of 0.0001, weight decay of 0.001, and minibatch size of 128 and used early stopping on the validation reconstruction loss. Except for the NeurIPS challenge that provided a test set, we changed the composition of random training, validation, and test set to , and respectively. We computed all evaluation metrics on the left-out test cells.

Input data and preprocessing Dataset Processing

NeurIPS Dataset The multiome dataset from the NeurIPS 2021 challenge13 was downloaded from the AWS bucket s3://openproblems-bio/public/. The complete processed dataset is also available on GEO (Accession GSE194122). We did not perform any additional filtering of the data. BAM files for visualization purposes were downloaded from GEO.

Hematopoiesis dataset

The hematopoiesis dataset12 was downloaded from GEO (Accession GSE129785). Specifically, the processed count matrix and metadata files: scATAC-Hematopoiesis-All.cell-barcodes.txt.gz, scATAC-Hematopoiesis-All.mtx.gz, scATAC-Hematopoiesis-All.peaks.txt.gz. We then filtered the genomic region to only those that are detected in at least of the cells in the sample, reducing the data from 571,400 to 134,104 peaks.

Fragment computation

The standard 10X protocol for generating the cell-peaks matrix is to count the fragment ends. As a simple way to estimate fragment counts, we rounded all uneven counts to the next highest 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 even number and halved the resulting read counts.

Evaluation Reconstruction metrics

The reconstruction metrics were calculated on the binarized matrix. Poisson rate parameters were transformed to a Bernoulli probability by computing the probability of getting one or more fragments in a peak for a given cell:

Average precision

As our reconstruction task is highly imbalanced (only a small fraction of all peaks are accessible), we used the average precision score as implemented in scikit-learn (v.1.0.2) to evaluate the reconstruction. Average precision estimates the area under the precision-recall curve. The score is between 0 and 1 with 1 being the best.

RMSE As in the NeurIPS challenge, we used the root mean squared error (RMSE) on binarized count matrices to benchmark the RNA-to-ATAC prediction task.

Integration metrics

We used the scib15 (v.1.0.0) implementation for computing the integration metrics on the latent embedding of the cells. We used all available metrics using default parameters but excluded metrics that were specifically developed for scRNA-seq datasets (highly variable genes score, cell cycle score) and kBET due to its long run time. The trajectory score was only run for the NeurIPS dataset, which had a precomputed ATAC trajectory.

Statistics scATAC-seq reconstruction

The robustness of the reconstruction metrics was evaluated using five-fold cross-validation. We then employed a two-sided paired t-test to test for significant differences between models and corrected for multiple testing using Benjamini-Hochberg.

NeurIPS competition evaluation

The models in the original NeurIPS competition were evaluated on a single set of 10,000 randomly chosen peaks. To be able to compute statistics on our improvement, we extended this set by ten additional sets of randomly chosen peaks, excluding the original peak set. Statistical 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 significance was evaluated using a two-sided paired t-test, corrected for multiple testing with Benjamini-Hochberg.

Benchmarking Methods Hyperparameter optimization

All models were run using the default PeakVI parameters. For the ATAC-ATAC task, we optimized the number of latent dimensions on the validation set for each dataset and model on reconstructing the binary accessibility matrix as measured by average precision. The used range was from 10 to 100 in increments of 10. In the RNA-ATAC prediction task, we optimized the latent dimensions on 10% of the training cells. Since we only trained on 10,000 peaks at a time, the used range was smaller, from 10 to 50 in increments of 10.

NeurIPS competition winner - LS_labs

We reran the winning model LS_labs of the Single NeurIPS Competition 2021 <Multimodal SingleCell Data Integration= in the category <Predict modality GEX to ATAC= using the published script that we downloaded from

GitHub https://github.com/openproblemsbio/neurips2021_multimodal_topmethods. In particular, we used the script.py file.
Data availability Code availability

Raw published data for the NeurIPS13 and Satpathy et al.12 examples are available from the Gene Expression Omnibus under accession codes GSE194122 and GSE129785, respectively. All models, code, and notebooks to reproduce our analysis and figures are available on https://github.com/theislab/scatac_poisson including documentation and examples.

Author contributions

L.D.M conducted the analysis and implemented the models. J.G. and F.J.T. conceived and supervised the project with the help of D.S.F. All authors wrote and contributed to the manuscript. The authors read and approved the final manuscript.

Competing interests

F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd and Omniscope Ltd, and has ownership interest in Dermagnostix GmbH and Cellarity. The remaining authors declare no competing interests.

Corresponding authors Acknowledgments

Correspondence to Julien Gagneur or Fabian J. Theis.

We would like to thank Ignacio Ibarra, Alexander Karollus, and Pedro Tomaz da Silva for valuable feedback and reading of the manuscript. L.D.M acknowledges support by the Helmholtz Association under the joint research school Munich School for Data Science - MUDS= and J.G by the Deutsche Forschungsgemeinschaft (SFB/Transregio TRR267). Fig. 1a is adapted from <ATAC Sequencing=, by BioRender.com (2022). Fig. 2f is adapted from <Regulation of Transcription in Eukaryotic Cells=, by BioRender.com (2022). Retrieved from https://app.biorender.com/biorender-templates. 2. Buenrostro, J., Wu, B., Chang, H. & Greenleaf, W. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr. Protoc. Mol. Biol. Ed. Frederick M Ausubel Al 109, 21.29.1-21.29.9 (2015). 4. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. 5. Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin Commun. 12, 1337 (2021). accessibility analysis. Nat. Genet. 53, 4033411 (2021).

Nat. Commun. 12, 6386 (2021). 6. Li, Z. et al. Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen. 7. Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC9. Xiong, L. et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 9753978 (2017).

Ji, Z., Zhou, W., Hou, W. & Ji, H. Single-cell ATAC-seq signal extraction and enhancement with SCATE. Genome Biol. 21, 161 (2020).

Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 9253936 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360

(2019).

Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and Cell Ranger ATAC Algorithms Overview. https://support.10xgenomics.com/single-cellproteins in single cells. in (2021).

Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics.

Deal, R. B., Henikoff, J. G. & Henikoff, S. Genome-Wide Kinetics of Nucleosome

Miao, Z. & Kim, J. Is single nucleus ATAC-seq accessibility a qualitative or quantitative

1. Buenrostro , J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory 3 . Stuart , T. , Srivastava , A. , Lareau , C. & Satija , R. Multimodal single-cell chromatin analysis seq data . Nat. Methods 16 , 3973400 ( 2019 ). 8. Ashuach , T. , Reidenbach , D. A. , Gayoso , A. & Yosef , N. PeakVI: A deep generative model for single-cell chromatin accessibility analysis . Cell Rep. Methods 100182 ( 2022 ) Schep , A. N. , Wu , B. , Buenrostro , J. D. & Greenleaf , W. J. chromVAR: inferring 11. Biotechnol. 40 , 1633166 ( 2022 ).