The advent and success of foundation models such as GPT has sparked growing interest in their application to single-cell biology. Models like Geneformer and scGPT have emerged with the promise of serving as versatile tools for this specialized field. However, the efficacy of these models, particularly in zero-shot settings where models are not fine-tuned but used without any further training, remains an open question, especially as practical constraints require useful models to function in settings that preclude fine-tuning (e.g., discovery settings where labels are not fully known). This paper presents a rigorous evaluation of the zero-shot performance of these proposed single-cell foundation models. We assess their utility in tasks such as cell type clustering and batch effect correction, and evaluate the generality of their pretraining objectives. Our results indicate that both Geneformer and scGPT exhibit limited reliability in zero-shot settings and often underperform compared to simpler methods. These findings serve as a cautionary note for the deployment of proposed single-cell foundation models and highlight the need for more focused research to realize their potential.2 The emergence of foundation models in machine learning has been both transformative and rapid, as evidenced by the success of systems like ChatGPT [1] and DALL·E [2]. Foundation models are machine learning methods pretrained on huge amounts of data, where the aim of the pretraining is to enable models to capture universal patterns in data [3]. These models serve as adaptable starting points that can either be fine-tuned, which involves a small amount of additional training to prompt the model to produce specific predictive outputs, or used zero-shot, which involves extracting the model's internal representation of input data (an "embedding") for downstream analysis with no further task-specific training. Preprint. Under review.
In single-cell biology, the foundation model framework offers an avenue for automating complex tasks,
such as cell type identification and gene expression prediction. Emerging research has begun to explore the
potential of foundation models in single-cell biology, particularly in single-cell transcriptomics, with several
models now available. These include scBERT [
2The code used for our analyses can be accessed at https://github.com/microsoft/zero-shot-scfoundation.
poses challenges for many labs. Even minimally fine-tuning foundation models can require extensive GPU
resources, given that, for example, scGPT’s architecture relies on the use of FlashAttention [
In this study, we assessed the zero-shot performance of two proposed foundation models in single-cell
biology: Geneformer [
We evaluated two proposed foundation models for single-cell transcriptomics: Geneformer [
Both models accept single-cell gene expression vectors as input but represent input data differently. The input
to the Geneformer model is a ranked list where the gene’s position represents the gene’s expression relative
to the remaining genes in the cell. The model leverages a BERT-inspired architecture with 6 Transformer
layers, each with 4 attention heads. Geneformer is trained using a modification of the masked language
modeling (MLM) task, where the model is trained to recover randomly selected genes that are masked or
corrupted. Since genes are ordered by their expression, this effectively predicts gene expression relative to
other genes. The model outputs gene embeddings, which are subsequently decoded into gene predictions. A
cell embedding is calculated by averaging over all gene embeddings extracted for that cell. Genefomer was
pretrained on 27.4M human single-cell transcriptomes (excluding malignant and immortalized cells).
scGPT preprocesses each gene expression vector by independently binning values into 50 equidistant bins
where the lowest bin is the lowest expression and the highest bin the highest expression. Next, the binned
values and the gene token (i.e. a unique index for each gene) are separately embedded, and summed in the
embedding space, jointly representing the gene and its binned expression. Like Geneformer, scGPT uses an
MLM task. However, scGPT directly learns a cell embedding, which is integrated into its pretraining loss
of predicting masked genes: scGPT first predicts a masked gene expression bin and a cell embedding from
unmasked genes and then, in a second step, further iteratively refines masked gene expression using the cell
embedding predicted in the first step. This means that scGPT outputs two sets of binned gene predictions in
its pretraining task, first from unmasked genes alone and second from conditioning on the cell embedding. In
our effort to understand the generalization of the pretraining objectives, we analyzed both. Finally, compared
to Geneformer, scGPT has 3× the parameters, using 12 Transformer layers with 8 attention heads. scGPT
is available in several variants, pretrained on multiple different datasets. In our analyses, we focused on
three variants of scGPT: pretrained on 814,000 kidney cells (scGPT kidney), on 10.3 million blood and bone
marrow cells (scGPT blood), and on 33 million non-cancerous human cells (scGPT human).
For baselines in evaluating cell embeddings, we compared Geneformer and scGPT against selecting highly
variable genes (HVGs). We standardize to 2,000 HVGs across all experiments. In addition, we compared all
methods to scVI, a scalable generative model [
To assess the quality of cell embeddings and performance on batch integration tasks, we used five distinct
human tissue datasets (Table 1). These datasets include samples from the pancreas [
Tabula Sapiens Cells from 24 different tissues
across 15 human donors.
Pancreas PBMC PBMC Immune
Cells from human pancreas created by combining data spanning 5 studies.
PBMCs from a healthy donor.
PBMCs from a healthy donor.
Immune cells extracted from 16 different tissues 330k across 12 adult organ donors.
16k 12k 95k 483k 14 9 10 45 24 6 2 1 31 27
Ref.
[
Among the selected datasets, the Pancreas dataset partially overlapped with the data used to pretrain
Geneformer. We conducted evaluations using both the complete Pancreas dataset and its non-overlapping subset.
The results were highly consistent between the two, leading us to include the entire Pancreas dataset for
simplicity in this evaluation. At the time of dataset selection, information on the data used for scGPT’s pretraining
was unavailable, preventing us from determining any potential overlaps at the time of our evaluations.
4Data available via data.pbmc_dataset function from scvi-tools [
In this work, we evaluated the cell embedding space for its ability to separate known cell types correctly and to integrate different batches. We also evaluated the performance of the models at the pretraining task by evaluating their reconstruction accuracy. 2.3.1
One key aspect of evaluating cell embeddings is the degree to which cell types are distinct within the
embedding space. To assess this, we employ metrics based on the Average Silhouette Width (ASW) [
To evaluate batch integration, we used a variation of the AWS score (as described in [
To evaluate the performance of scGPT in its pretraining objective, we used the mean squared error (MSE),
as used by the authors for the model’s loss [
Current proposed single-cell foundation models produce cell embeddings. These embeddings are intended
to project potentially noisy gene expression measurements to a more biologically relevant latent space and
to thus improve our ability to resolve cell types, consistent with previous machine learning methods in this
field (including scVI) [
We evaluated cell type clustering using two metrics, ASW and AvgBIO. For both metrics, Geneformer and scGPT performed worse than our baseline strategies. For ASW, scVI consistently performed well, achieving a median ASW of 0.54 and hitting a low of 0.47 in the Tabula Sapiens dataset (Fig. 2A). Geneformer’s performance was more variable, with scores ranging from a high of 0.51 in the PBMC (95k) dataset to a low of 0.37 and 0.38 in the Tabula Sapiend and Pancreas (16k) datasets, respectively. scGPT’s performance was comparable with scVI, with median ASW equal to 0.53 and 0.54, respectively. Notably, HVG outperforms Geneformer in all datasets except PBMC. For AvgBIO, HVG surpassed all other models in AvgBIO score in 0.6 0.5 0.4 0.3
Average silhouette width (ASW) score
Average BIO (AvgBIO) score
scVI
three out of five datasets (Fig. 2B). In the PBMC (12k) dataset, scVI, and scGPT performed similarly - both
scoring 0.69, while HVG matched the performance of Geneformer, achieving a score of 0.60. In the PBMC
(95k) dataset, scVI reached a score of 0.59, while HVG lagged slightly behind with a score of 0.53.
Foundation models usually employ self-supervised tasks to
enable scalability since they can train on any dataset, not just ones Average BIO (AvgBIO) score
with labels [
Overall, our findings demonstrate that foundation models in zero-shot configurations generally fail to outperform cell embeddings derived from HVG or generated using the scVI model. Evaluating variants of the scGPT model also highlights that pretraining on datasets spanning the same tissues does not necessarily equate to performance above random initialization. 3.2
Next, we sought to assess the zero-shot capabilities of proposed single-cell foundation models in batch
integration. Single-cell transcriptomics experiments, like all biological experiments, are impacted by batch
effects - systematic technical differences present when integrating data over different experiments, sequencing
technologies, or even when the experiment is reproduced for the same biological replicates. Due to batch
effects, tasks like mapping a new experiment to a reference atlas to identify the cell types present in the
data can fail. Hence, a common task in single-cell analysis is to eliminate batch effects without removing
meaningful biological differences, allowing for data integration [
All genes
HVG scVI scVI
We began with a qualitative evaluation of the Pancreas dataset, a
common batch integration benchmark that includes data from five different
sources [
Dataset PPaBnMcCrea(1s2(k1)6k) ITmabmuulaneSa(3p3ie0nks) (483k) Overall, we observed that while Geneformer and scGPT-human can inteFigure 5: HVG selection out- grate different experiments conducted with the same experimental techperforms proposed foundation nique, they generally fail to correct for batch effects between techniques. models. Batch integration score As depicted in Fig. 4A, the cell embedding space generated by Gene(described in Section 2.3.2) cal- former fails to retain information about cell type, and any clustering is culated for all four datasets with primarily driven by batch effects (Fig. 4B). On the other hand, the space at least two batches. created by scGPT offers some separation of cell types (Fig. 4A), but the primary structure in the dimensionality reduction is driven by batch effects (Fig. 4B). In contrast, even the simple baseline of selecting highly variable genes (HVG) qualitatively produces a similar or better result to scGPT, with the Smarter technique now being integrated with InDrop. Finally, we observed that scVI mostly integrates this dataset, forming clusters primarily due to cell type, with most techniques in the same cluster.
To support these qualitative results, we produced batch integration metrics for each of our five datasets (Fig. 4C). Geneformer underperforms compared to both scGPT and scVI across most datasets, achieving a median batch integration score of only 0.79. scVI outperforms scGPT in datasets where the batch is restricted to the technical variation (Pancreas and PBMC datasets), and scGPT performs better in more complex datasets where both technical and biological batch effects are present (Immune and Tabula Sapiens datasets). Surprisingly, the best batch integration scores for all datasets were achieved by selecting HVG. This observation is slightly different from our qualitative evaluations of the UMAPs where scVI performs better, and can be explained by shifts in our rankings calculating metrics in full rather reduced dimensions as seen in Fig. S2 (we note that trained proposed foundation models underperform baselines in both settings). In summary, our evaluation suggests that Geneformer and scGPT are not fully robust to batch effects in zero-shot settings, often lagging behind existing methods like scVI, or simple data curation strategies like selecting for HVG, particularly when batch effects are more severe. 3.3
Nex, to understand why Geneformer and scGPT underperform compared to baselines zero-shot, we posited two hypotheses. First, it could be that the masked language modeling pretraining framework used by both scGPT and Geneformer does not produce useful cell embeddings. The second could be that scGPT and Geneformer have failed to generalize the pretraining task. Understanding this distinction could produce insights for future directions. For example, if the models are reconstructing masked gene expression well for our evaluation datasets but still failing to produce informative cell embeddings, this implies that a different task may need to be designed; while if the models fail to predict gene expression accurately, improvements to learning the pretraining task could still potentially improve the cell embeddings of these models. scGPT (masked expression prediction) scGPT (prediction from cell embedding)
Inpu5t0bins 40 30 20 10
C
Geneformer 1.00 g n irna0.75 k n o i s s re0.50 p x e d e t icd0.25 e r P 0.00
Density 10 5
Evaluating whether models reconstruct masked gene expression accurately requires us to select how many genes are masked in input. In training, both models select a percentage of genes to mask. However, following a similar procedure for evaluation introduces stochasticity, and re-running random samples and/or iterating over genes to account for this is computationally expensive. We, therefore, use all genes unmasked as input. Not only does this eliminate stochasticity from sampling masked genes, but it also reflects the maximally informative setting where models are asked to reconstruct genes given complete, not partial, input. To gauge the quality of these reconstructions, we compared them to their true values. For scGPT, we compared the bin value for each gene. Since scGPT produces gene predictions at two stages (with and without conditioning from its cell embedding), we report both. For Geneformer, we compared the gene rankings. Fig. 6 illustrates that both models face challenges in reconstructing gene expression. Without conditioning on cell embedding, scGPT predicts the mean value of the input bin across all bin values (Fig. 6A). Predictions improve when conditioned on cell embeddings, particularly for higher input values (Fig. 6B). Geneformer also shows limitations. Under its MLM objective, it predicts the most likely gene at a given position. Although there is a strong positive correlation for high-expression genes, the model fails to predict low-expression genes (Fig. 6C), similar to scGPT when conditioned on cell embeddings. ) E S (M1500 r o r re1000 d e ra 500 u q s n 0 a e M
Mean
scGPT random GEP GEPC
scGPT kidney GEP GEPC
scGPT blood GEP GEPC
scGPT human GEP GEPC Dataset
Next, we compared the performance of scGPT against a naive baseline of just predicting the mean expression value of a gene. Surprisingly, this baseline prediction outperformed all scGPT variants when not using cell embeddings (Fig. 7), with only marginal improvements observed when conditioning on cell embedding. Geneformer does not directly predict expression but generates a ranked list of genes. To evaluate Geneformer, we therefore measure Pearson’s correlation between the predicted ranking of genes and the actual gene ranking. Overall, there was only a moderate correlation (Fig. 8), with the median correlation across all five datasets of 0.59, with a best correlation of 0.96 on the PBMC (95k) dataset. 4
Discussion 1.00 n ito0.75 a l e r r co0.50 s ' n o rs0.25 a e P 0.00
Average ranking In this work, we evaluated two proposed foundation models for singlecell biology – Geneformer and scGPT – and demonstrated their un- Pancreas (16k) Immune (330k) reliability in zero-shot settings. In cell type clustering analyses, both Dataset PPBBMMCC ((9152kk)) Tabula Sapiens (483k) models fail to improve reliably over scVI. Critically, for some datasets, Figure 8: Geneformer outputs imthe proposed foundation models perform worse at clustering cell types proved rankings over the average. than just selecting highly variable genes. At least for scGPT, we show Pearsons’s correlation computed bethat matching the tissue of origin of the pretraining dataset to the target tween the input ranking and ranking task does not guarantee performance over even random initialization, composed of the average position of and that increasing the size and diversity of the pretraining dataset a gene across all cells in the dataset over smaller tissue-specific data can sometimes decrease performance. (left) or Geneformer output (right). This suggests more research is needed to articulate the relationship between pretraining data and performance. We also demonstrate that these models are not fully robust to batch effects in zero-shot settings, often lagging behind methods like scVI or simple data curation strategies like selecting for HVG. Together, our results caution against using current single-cell transcriptomic foundation models in zero-shot settings. Our analyses provide some insight on where future work needs to be concentrated to build bonafide foundation models that are truly useful in these settings. We showed that neither scGPT nor Geneformer can accurately predict gene expression on our evaluation datasets, even though these models are directly trained to predict gene expression via their pre-training tasks. Notably, scGPT defaults to predicting the median bin when only given access to gene embeddings (and not a cell embedding). This raises the possibility that current adaptations of masked language modeling (MLM) are not effective at learning gene embeddings, which would also impact Geneformer, given that it produces a cell embedding by averaging over gene embeddings. Whether MLM, in general, is suited for learning single-cell embeddings is still an open question, but our work suggests that current models are not effective at generalizing the MLM objective and that a good next step in the field would be to improve the representation of genes and gene expression to overcome this challenge.