1 Introduction

November

Adapting protein language models for rapid DTI prediction

Samuel Sledzieski∗

samsl@mit.edu 0

Rohit Singh∗

rsingh@mit.edu 0

Lenore Cowen

cowen@cs.tufts.edu 1

Bonnie Berger

bab@mit.edu 0 0 MIT 1 Tufts University

2022

4 2022

We consider the problem of sequence-based drug-target interaction (DTI) prediction, showing that a straightforward deep learning architecture that leverages pre-trained protein language models (PLMs) for protein embedding outperforms state of the art approaches, achieving higher accuracy, expanded generalizability, and an order of magnitude faster training. PLM embeddings are found to contain general information that is especially useful in few-shot (small training data set) and zero-shot instances (unseen proteins or drugs). Additionally, the PLM embeddings can be augmented with features tuned by task-specific pre-training, and we find that these task-specific features are more informative than baseline PLM features. We anticipate such transfer learning approaches will facilitate rapid prototyping of DTI models, especially in low-N scenarios.

1 Introduction

Predicting drug-target interaction (DTI), a critically important problem in drug discovery, should ideally be informed by protein and drug structures. However, even if all protein structures were available (say, by AlphaFold2 prediction [ 14 ]), the computational expense of docking is prohibitive for large-scale DTI screening, suggesting that sequence-based prediction of DTIs will remain important. Accordingly, in this paper we consider the computational prediction of DTIs when the inputs are a) a molecular description of the drug (such as the SMILES string [ 1 ]) and b) the amino acid sequence of the protein target. Many methods have been proposed to address the DTI problem in this formulation [ 2 ], with state-of-the-art approaches relying on deep learning architectures that build protein representations using sequence models like convolutional neural networks [ 15 ] and, more recently, transformers [ 11 ]. As the focus of this work is optimal protein representations for DTI prediction, we fix the drug molecular representation here to the commonly used Morgan fingerprints [ 17 ], noting that recently introduced alternative representations may further increase performance [ 12, 13 ].

The key claim of this work is that a pre-trained protein representation from protein language models can offer state-of-the-art performance with a relatively straightforward model architecture, even on tasks that have already been the focus of dedicated deep learning model design. Pre-training is especially impactful when the training set is small or unbalanced or if the test set contains hitherto unseen proteins or drugs. We emphasize here the importance of matching the pre-trained representation to the task. As we show in a feature attribution analysis, augmenting language models with additional training on protein-protein interaction (PPI) prediction yields features which are more informative for DTI prediction than those from baseline embeddings that were pre-trained only on protein sequences. Thus, our work constitutes a concrete demonstration of the power of a well-designed transfer learning approach that adapts foundation models for a specific task [ 4, 7 ].

∗Authors contributed equally to this work Given a SMILES string and protein sequence, our predictive framework consists of the following three steps (Figure 1): 1. Featurize the drug and protein using pre-trained models or algorithms. 2. Transform both the drug and protein into a shared latent space.

3. Output the DTI prediction based on the drug–protein distance in the latent space. Molecular featurization: We featurize the drug molecule by its Morgan fingerprint [ 17 ], an encoding of the SMILES string of the molecular graph as a fixed-dimension embedding M ∈ Rdm (we chose dm = 2048) by considering the local neighborhood around each atom. The utility of the Morgan fingerprint for small molecule representation has been demonstrated in [ 20 ].

Pre-trained protein featurization: We generate protein features using pre-trained protein language models (PLM): These models generate a protein embedding E+ ∈ Rn×d for a protein of length n, d which is then pooled along the length of the protein resulting in a vector E ∈ R . Specifically, we investigate the PLMs from Bepler & Berger [ 3 ], ESM [ 19 ], and ProtBert [ 6 ], with default dimensions d = 6165, 1280, 1024 respectively.

Additionally, we evaluate the output of the projection module of a D-SCRIPT PPI prediction model trained on human PPIs, using each language model as input embeddings [ 22 ]. Details of this featurization (d = 100) can be found in Appendix A.1 We emphasize that the language and projection models are used exclusively to generate input features– their weights are kept unchanged and are not updated during DTI training.

Transformation into a shared latent space: Given small molecule embedding M ∈ Rdm and

protein embedding E ∈ Rd, we transform them separately into M ∗, E∗ ∈ Rh using fully-connected multi-layer perceptrons with a ReLU activation. Given the latent embeddings M ∗, E∗, we compute the probability of a drug-target interaction pˆ as the cosine similarity between the embedding vectors. Training and Implementation: The loss was calculated using the binary cross entropy between the true labels y and the predicted interaction probabilities pˆ. Model weights were updated with error back-propagation using the Adam optimizer with learning rate 10−4 over 50 epochs, with a batch size of 16. We used a latent dimension size of 1024 (results were robust to variations in latent dimension size). We implemented this framework in PyTorch version 1.9. Data Sets: To evaluate the predictive accuracy of our framework, we use three different DTI benchmark data sets. Two data sets, DAVIS [ 5 ] and BindingDB [ 16 ], consist of pairs of drugs and targets with experimentally determined dissociation constants (Kd). Following [ 11 ], we treat pairs with Kd < 30 as positive DTIs, while larger Kd values are negative. The third data set, ChG-Miner from BIOSNAP [ 24 ], consists of only positive DTIs. The DAVIS data set represents a few-shot learning setting: it contains only 2,086 training interactions, compared to 12,668 for BindingDB and 19,238 for BIOSNAP. The rest of the data preparation follows [ 11 ]. We create negative DTIs by randomly sampling an equal number of protein-drug pairs, with the expectation that a random pair is unlikely to be positively interacting. The data sets are split into 70% for training, 10% for validation, and the remaining 20% for testing. Training data is artificially sub-sampled to have an equal number of positive and negative interactions, while validation and test data is left at the natural ratio. Full specification of the data, including number of unique drugs, proteins, and positive/negative edges can be found in Table A1.

Experiment Design: For each data set, we evaluate the predictive performance of the pre-trained Bepler & Berger, ESM, and ProtBert model embeddings. Additionally, we compare with MolTrans [ 11 ], GNN-CPI [ 23 ], and DeepConv-DTI [ 15 ], which have been shown to achieve state-of-the-art performance on DTI prediction, and specifically on these benchmark data sets. During training, we monitor the AUPR, AUROC, and F1 metrics on the validation set, and store the model with the highest AUPR, which is then evaluated on the held-out testing set. Each training is run with 5 random initializations, and we report the mean and standard deviation of each metric.

3.1 Improved DTI prediction

We demonstrate that co-embedding pre-trained protein language model features with small molecule features achieves state-of-the-art performance on all three benchmark data sets. Across all three data sets the three PLM models perform similarly, which is consistent with prior work which shows that there is often ambiguity as to which PLM is best suited to a given task [ 8, 9 ]. However, the three PLM models consistently outperform the non-PLM models, with especially large improvement coming on the most challenging DAVIS set where very little training data is available and evaluation is unbalanced. DeepConv-DTI achieves the best AUROC on the unbalanced BindingDB, but the PLM models have higher AUPR, which is the more representative metric for unbalanced data sets. 3.2

Feature attribution reveals information gain from tuning on PPI

We additionally investigated training DTI models using protein language model embeddings augmented with features from a D-SCRIPT model pre-trained on human PPIs (see Appendix A.1 for details). While the top-line performance of the augmented models are similar to the base models (Table A3), an attribution study using DeepLift [ 21 ] shows that the new D-SCRIPT-derived features are disproportionately represented in the set of highly important features. This suggests that tuning on a related task refines the representations from the general protein language models to ones more suited for the specific task, as in [ 7 ]. This explanation is supported by the fact that the 100-dimensional D-SCRIPT features alone achieved only slightly decreased performance on the DTI task compared to PLM-based models with 10-50x as many parameters (Table A3). 3.3

Zero-shot learning with pre-trained protein embeddings

In [ 22 ], Sledzieski et al. demonstrate that language models enable D-SCRIPT to generalize especially well to out-of-species PPIs. Here, we show that generalization extends to DTIs for proteins which are unseen in the training set. The Bepler & Berger | BB-D-SCRIPT featurization outperforms MolTrans in predictive performance on a variation of the BIOSNAP data where 20% of proteins and all corresponding interactions were removed as a test set. The outperformance over MolTrans is not as stark in the unseen drugs domain, possibly because the informational advantage of pre-training disproportionately benefits the protein representations (Table 2). The performance of other PLM models was similar to the one shown (Table A3). 3.4

Pre-training enables an order of magnitude faster optimization

One of the benefits of using a hierarchy of pre-trained models is that computation times are amortized over the lifespan of downstream applications. Pre-trained models incur an up-front computational cost, but can then be re-used for multiple inference tasks with straightforward architectures. Our framework allows for training DTI models up to an order of magnitude faster than an end-to-end method. Inference of DTIs is also faster—here anywhere from a 2x to 5x speedup (Table A2). 4

Discussion

Previous work has recognized the value of meaningful drug molecule representations for DTI prediction [ 10, 18 ], but relatively little work has focused on the target protein representation. Here, we show that pre-trained embeddings from protein language models, combined with simple molecular features, not only achieve state-of-the-art performance for the DTI prediction task but also enable substantially better accuracy in the few-shot (DAVIS data set) or zero-shot (unseen proteins) learning settings. Also, features learned from pre-training on the related PPI prediction task can provide additional information beyond general protein language models. This approach enables generalization to unseen proteins as well as fast model training and inference. This is particularly valuable for drug re-purposing and iterative screening where large compound libraries are evaluated against hitherto-uncharacterized proteins from pathways implicated in a disease of interest. Our framework may enable more accurate transfer of DTI from the model organisms on which drugs are initially tested to their eventual use in human patients. This work demonstrates the previously unexplored value of language models in the DTI prediction domain, the additional information unlocked by pre-training on related tasks (PPI prediction), and the power of iterative adaption for transfer learning. Acknowledgements SS was supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1745302. RS and BB were supported by the NIH grant R35GM141861. LC was supported by NSF OAC-1939263 and CCF-1934553 In the original D-SCRIPT paper, protein sequence embeddings from the Bepler & Berger model are used as input features. Here, we additionally train D-SCRIPT models using ESM and ProtBert embeddings as features. These models are trained on the human cross-validation set from [ 22 ]. We refer to the output of the first projection module, which takes as input the n × d protein representation and tunes it to an n × 100 representation, as [PLM]-D-SCRIPT, where PLM is either Bepler & Berger (BB), ESM, or ProtBert. As inputs for our DTI model, we explore concatenating these embeddings to the raw PLM embeddings (e.g. [ESM | ESM-D-DSCRIPT]) so each amino acid has d + 100 features, and using the D-SCRIPT projections on their own.

Benchmark AUPR AUROC BIOSNAP

19238

DAVIS 2086 No BIOSNAP Unseen Proteins BIOSNAP Unseen Drugs

[Bepler & Berger | BB-D-SCRIPT] BB-D-SCRIPT