February 10.1101/2022.12.13.520346 Efficient and scalable de novo protein design using a relaxed sequence space Christopher Frank 0 4 5 6 Ali Khoshouei 0 4 5 6 Yosta de Stigter 0 4 5 6 Dominik Schiewitz 0 4 5 6 Shihao Feng 0 2 6 Sergey Ovchinnikov 0 1 3 6 Hendrik Dietz dietz@tum.de 0 4 5 6 85748 Garching , Germany Faculty of Applied Sciences, Harvard University , Cambridge MA , USA Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University , Shanghai John Harvard Distinguished Science Fellowship, Harvard University , Cambridge, MA , USA Laboratory for Biomolecular Nanotechnology, Department of Biosciences, School of Natural Sciences Munich Institute of Biomedical Engineering, Technical University of Munich , Boltzmannstraße 11 Technical University of Munich , Am Coulombwall 4a, 85748 Garching , Germany 2023 25 2023

Deep learning techniques are being used to design new proteins by creating target backbone geometries and finding sequences that can fold into those shapes. While methods like ProteinMPNN provide an efficient algorithm for generating sequences for a given protein backbone, there is still room for improving the scope and computational efficiency of backbone generation. Here, we report a backbone hallucination protocol that uses a relaxed sequence representation. Our method enables protein backbone generation using a gradient descent driven hallucination approach and offers orders-of-magnitude efficiency enhancements over previous hallucination approaches. We designed and experimentally produced over 50 proteins, most of which expressed well in E. Coli, were soluble and adopted the desired oligomeric state along with the correct composition of secondary structure as measured by CD. Exemplarily,

-

31 we determined 3D electron density maps using single-particle cryo EM analysis 32 for three single-chain de-novo proteins comprising 600 AA which closely 33 matched with the designed shape. These have no structural analogues in the 34 protein data bank (PDB), representing potentially novel folds or arrangement of 35 domains. Our approach broadens the scope of de novo protein design and 36 contributes to accessibility to a wider community.

Introduction

Deep-Learning (DL) based protein structure prediction methods such as AlphaFold2 (AF2)1 and RoseTTAFold2 can generalize beyond the sequence and structure space they have been trained upon to correctly predict the structures of de novo designed proteins3–6. Additionally, AlphaFold2 can distinguish between favorable and unfavorable structure templates7, suggesting that it has learned the physical properties of energetically stable protein backbones. DL-based structure prediction methods can thus connect sequence space and protein structure space, and enable searching for protein structures that satisfy a given design target. A DL based design method called ‘deep network hallucination’8 leverages this connection for a variety of protein design problems by iteratively updating a protein sequence until a desired property encoded in a mathematical loss function is obtained. Typically, when performing deep network hallucination, random mutations are applied in a Monte Carlo Markov Chain (MCMC) fashion4,6,9–11. However, this method can be computationally inefficient due to the need for a large number of iterations. This approach may also fail to identify a solution, similar to issues faced by numerical solutions to mathematical problems in a random Monte Carlo fashion. To improve the likelihood of generating accurate predictions, researchers have proposed using sequence gradients obtained by inverting structure prediction networks12. However, when updating the gradients, a major issue arises due to the discrete, one-hot encoded representation of sequences. One-hot encoding (Fig.: 1, A) represents each amino acid or nucleotide as a binary vector with all zeros, except for a single ‘1’, indicating the position of the corresponding residue. While the obtained gradient from the backpropagation through the structure prediction network is non discrete, previous approaches tried to regain discreteness by a ‘straight-through estimator’13 62 in the sequence update loop5,12,14. Though this approach worked well for models 63 predicting distribution of distances for every pair of positions, such as TrRosetta, the 64 approach did not work well in more recent models such as AlphaFold, where single 65 amino acid changes could radically alter the predicted structure, resulting in unstable 66 and inefficient optimization. Therefore, alternative methods of representing 67 sequences in a continuous format are being explored to overcome this problem. 68 69 The AF2 network can predict sequences generated through the hallucination process 70 with high confidence. However, despite this high confidence, the experimental 71 success rate of the resulting proteins is often low, resulting in the production of 72 insoluble proteins6,15. To resolve this issue, it has become standard practice to 73 generate a new sequence for the given backbone using the ProteinMPNN sequence 74 design software15. ProteinMPNN has demonstrated significant success in designing 75 soluble sequences for specific protein backbones, making it a valuable tool for 76 improving the experimental success rate of the resulting proteins. Since the sequence 77 design process for a given backbone geometry can be reliably handled with 78 ProteinMPNN, we hypothesized that using a continuous (relaxed) sequence 79 representation would allow for a more targeted and efficient gradient-descent search 80 through the structure space for desired backbone properties while avoiding the 81 issues associated with the need for a discrete, one-hot encoded sequence. A relaxed 82 sequence representation approach may thus allow for more precise and effective 83 exploration of the structure space for backbones, resulting in improved outcomes for 84 protein design. 85 86 To test our hypothesis, we utilized a representation where each amino acid position 87 in the input sequence is a vector of 20 values representing the contributions from 88 each of the 20 natural amino acids. These values can be fractional, greater than one, 89 or negative and positive in value (Fig.: 1, B). Though the input sequence, termed 90 "logits", may represent an unnormalized probability distribution of amino acids, in this 91 work they serve as a means to more efficiently explore the AF2 network for backbone 92 geometries. Specifically, the relaxed sequence representation enables adopting a 93 gradient-descent based hallucination protocol that optimizes a loss function with 94 rapid convergence towards high confidence protein structures (Fig.: 1, C). To 95 implement this relaxed sequence representation, we used the ColabDesign 96 framework, which offers various deep network hallucination protocols with a 97 modified, differentiable version of AF2. 98 99 Using this approach for backbone generation, we set up a protein design pipeline 100 that uses relaxed sequence hallucination to generate target backbone geometries. 101 After obtaining high-confidence backbone coordinates with a pLDDT score greater 102 than 90, we then use the ProteinMPNN network to generate actual physical candidate 103 sequences represented as a one-hot encoding (Fig.: 1, D). We then repredict the 104 candidate sequences by AlphaFold2 and ESM-fold16 to validate the one-hot 105 candidate sequences obtained from ProteinMPNN with respect to forming the input 106 backbone geometry. This step is necessary since AF2 models P(structure | sequence) 107 whereas MPNN models P(sequence | structure). It is thus possible that P(sequence | 108 wrong_structure) >= P(sequence | desired_structure). Therefore one needs to confirm 109 that P(desired_structure | sequence) is valid using a new pass through a structure 110 prediction model.

Computational design pipeline & benchmarking

To test the efficiency of our relaxed sequence hallucination, we performed a benchmarking experiment in which we unconditionally hallucinated backbones with three different protocols: gradient-descent with relaxed sequence (GD relaxed); GD with argmax() and one-hot encoded sequences (GD hard); and Markov-Chain-MonteCarlo search (MCMC) within the ColabDesign framework (Fig.: 2, A). GD relaxed robustly outperformed the other design approaches in all cases tested: the average convergence “half-life” was 20 iterations across all design test cases, whereas GD hard and MCMC failed to converge to satisfactory results (pLDDT >90) within the tested number of sequence update steps (Fig.: 2, A & Supp. Fig.: 1, A). Notably the GD relaxed method reliably produced high confidence backbones with pLDDT score larger than 95 for all unconditional hallucination trajectories tested (Fig.: 2, A). We note GD relaxed utilizes a back propagation step to implement the gradient descent, as opposed to only performing forward passes through AF2 as in MCMC. Hence, the computational effort for one sequence update step is greater in GD relaxed, but because of the orders-of-magnitude reduction in steps needed to convergence, relaxed sequence hallucination strongly outperforms MCMC in terms of design speed (Supp. Fig.: 1, B).

The output of GD relaxed backbone hallucination is structurally diverse, including all helical to all beta and combinations of helical & sheet mixtures (Fig.: 2, B) and can also be utilized to hallucinate backbones with multiple subunits. To further benchmark our hallucination approach, we examined the backbone recovery17 over multiple sizes of proteins ranging from 100 to 1000 amino acids (AAs). To this end we generated eight candidate sequences for each GD relaxed generated target backbone using ProteinMPNN, and then re-predicted these candidate sequences with AlphaFold2 in single sequence mode. We calculated the root-mean-square-deviation and TMScore18 of the predicted structure relative to the initial backbone input and chose the best hit. We observed high agreement of predictions of MPNN-redesigned sequences to the target backbones over all protein sizes tested (Fig.: 2, C). Judging by the obtained median RMSD relative to the target backbone, relaxed sequence hallucination delivered median sub 4 angstrom backbone recovery for target proteins up to 600 AAs in size. The backbone recovery efficiency drops for backbones larger than 600 AAs (Fig.: 2, C).

To address the potential concern that the generated backbones are adversarial and specific to AlphaFold, we used an independently trained model ESM-fold, which was recently used to generate experimentally validated de novo designs9, as an orthogonal test. We see ESM-Fold is able to recapitulate the ProteinMPNN redesigned backbones in a similar fashion than AF2 with single sequence input (Fig.: 2, D).

Experimental validation

To test whether our GD relaxed hallucination protocol could produce real-world expressible proteins we generated a test panel of monomeric protein backbones covering sizes ranging from 100 to 600 amino acids. The geometries were chosen to resemble a mixture of alpha-helical and beta-sheet structural elements. We then fed the resulting backbones to ProteinMPNN for candidate sequence generation and performed AF2 single sequence prediction verification. The candidate sequences were gene-synthesized, expressed in Escherichia coli (BL21), and purified through immobilized metal affinity chromatography (IMAC). 13 out of 14 tested protein designs, no matter their size, expressed well, were soluble and had the correct molecular weight as determined by SDS page analysis (Supp. Fig.: 2, A). Size exclusion chromatography elution profiles obtained under native conditions (Fig.: 3 (SEC)) were consistent with the expected sizes of the tested proteins and predominantly showed one single elution peak. All of the SEC tested proteins produced circular dichroism (CD) spectra that indicated that the proteins are wellfolded and featured the secondary structure elements that were expected based on their design. All of the tested proteins were also markedly thermostable, as indicated by the CD spectra remaining largely unaltered up to temperatures of 95°C (Fig.: 3, CD). We attribute these satisfactory results to favorable backbone and sequence generation by GD relaxed and ProteinMPNN working in concert and the AF2 single sequence filter. Indeed, one-hot-encoding the relaxed sequence representation without passing backbones through ProteinMPNN, resulted in largely insoluble proteins in our hands (Supp. Fig.: 2, B), consistent with previous findings9. With a pass through ProteinMPNN, the designed sequences expressed and folded according to expectation (Supp. Fig.: 2, B), in support of previous findings6,15. This shows the high value of ProteinMPNN in the design pipeline. Complementarily, the GD relaxed approach now efficiently generates high quality backbones.

We adapted our pipeline to design multimeric proteins (See Materials & Methods) and tested homo-dimeric and -trimeric candidate designs. All oligomeric designs produced soluble proteins with the expected size of the monomeric chain as determined by SDS-PAGE (Supp. Fig.: 3, A). The oligomeric proteins also had CD spectra that were in good agreement with the expected secondary structure (Supp. Fig.: 3, B). We used FPLC-SEC to evaluate whether the proteins actually formed oligomers as designed. This analysis indicated that out of our design candidates, two homodimers and two homotrimers formed (Fig.: 4, A & B). Next, we modified our pipeline to produce heterodimeric proteins. To this end, we installed a gap of 50 amino acids19 in the relaxed sequence input to AF2, which modifies the relative positional encoding to the maximum constant, since the relative encoding is clipped at sequence separation of 32. We hallucinated 1000 backbones and generated three candidate sequences for each backbone design with ProteinMPNN. We then used AF2 single sequence prediction to compute separately the structures of the two single chains in each heterodimers. We also predicted all possible homodimers and heterodimers combinations (Fig.: 4, C) to rank the sequences that form with high confidence heterodimers and have low tendency to form homodimers. We scored the designs based on pLDDT and an interface predicted align error (iPAE) (Fig.: 4, D) that is obtained by summing only over the PAE between the two chains. We experimentally tested 18 design pairs, out of which we selected three candidate design pairs because of superior expression yields. The design pair C5-C6 (Fig.: 4, E) showed two distinct monomer peaks in the SEC when expressed and purified individually. A new peak at larger molecular weight emerged in the elution profile once the purified proteins were mixed. The design pair D9-D10 also showed the designed heterodimer formation when we mixed both chains (Fig.: 4, F). Interestingly, one of the monomer chains (D9) formed a homo-oligomer that disassembled and got replaced with the hetero-dimer when D9 was mixed with D10. AF2 predicts that D9 by itself forms a homo-tetramer. These tests of homo- and heterooligomer designs, while not exhaustive, suggest that our relaxed sequencing hallucination method is also capable of designing functional protein-protein interfaces that specifically interact with each other.

Finally, we validated experimentally the structures of a set of candidate designs using single particle cryo electron microscopy (cryo-EM) analysis. To this end we randomly chose three out of the five proteins comprising 600 amino acids described in Figure 3 with a molecular weight around 60 kDa. We determined 3D cryo-EM maps for all three candidate designs using image data acquired with a Titan Krios cryo EM followed by image processing with RELION20. The resulting maps had resolution between 5.7 and 5.9 A according to the FSC gold standard criterion21 (Supp. Fig.: 4).

At this resolution, the global shape and secondary structure elements can be clearly discerned (Fig.: 5, A, B, C). Rigid body docking using ChimeraX22 of AF2 predictions to the measured cryo EM density maps showed good agreement with the designed models. While the resolution was not high enough to obtain atomic models from the densities, using PHENIX real space refinement23 we could fit the AF2 predictions to the cryoEM densities and obtain RMSDs of 0.36 A (P005), 0.44 A (P008) and 0.544 A (P009), respectively, between the PHENIX-fitted atomic models and the AF2 predictions. While these fits may not be sufficiently accurate to evaluate details such as side chain geometries, they do indicate that the relaxed sequence pipeline delivers correct output backbone geometries. 229 Conclusion 230 In this work we introduced and experimentally tested a novel backbone generation 231 method using relaxed sequence hallucination. By enabling the unconstrained 232 optimization of sequence gradients using this relaxed sequence representation, our 233 method can efficiently and accurately generate protein backbones up to 600 amino 234 acids long and possibly beyond. We demonstrated experimentally that the designs 235 translate to real proteins that fold well and have the designed properties. Our 236 approach can generate monomeric proteins but also homo oligomers and 237 heterodimers. The relaxed sequence hallucination method provides substantial 238 efficiency advantages relative to the more commonly used MCMC methods. For 239 example, recently described de-novo designed luciferase enzymes were produced 240 by MCMC hallucination with 30,000 iterations4 or large language model based 241 designs requiring up to 170.000 iterations9. By contrast, our gradient-descent 242 approach typically converged within < 100 iterations. We thus expect our method to 243 drastically speed up the throughput and scope of protein design tasks. Finally, our 244 relaxed sequence gradient descent provides an alternative to recently introduced 245 denoising diffusion models17,24–26, providing researchers with even more tools to 246 enhance the capabilities of de novo protein design. 247 248 249 250 251 252 253 254 255

Author contribution:

C.F. and H.D. designed the research. H.D. supervised the research. C.F. designed proteins and performed computational studies. D.S and C.F. designed heterodimers. Y.d.S designed AF only proteins. S.O. and S.F. developed the ColabDesign Framework. S.O. helped conduct computational studies. C.F. Y.d.S. and D.S. performed wet lab experiments. A.K. prepared Cryo-EM grids, collected data and performed data processing. C.F., A.K. and H.D wrote the manuscript. Y.d.S. and S.O. proofread the manuscript.

Data & Code availability

All experimental data, sequences, AF2 predictions and Cryo-EM maps will be made available after publication. The exact code used in this manuscript will also be available after publication. The code for the Colabdesign framework is freely available at: https://github.com/sokrypton/ColabDesign 262 Acknowledgement 263 We thank Google Cloud Services for providing computational resources. We thank 264 Massimo Kube for help with compute cluster maintenance and useful discussions 265 and Maximilian Honemann for guidance with cloning. We thank Justas Dauparas for 266 help debugging the ColabDesign code. 267 268 This work was supported by a European Research Council Advanced Grant to H.D. 269 (grant agreement 101018465), the Deutsche Forschungsgemeinschaft through grants 270 provided within the Gottfried Wilhelm Leibniz Program (to H.D.). We (H.D. & C.F.) 271 acknowledge additional support via the Excellence Strategy of the Federal 272 Government and the Länder through the TUM Innovation Network RISE grant. S.O. 273 is supported by NIH DP5OD026389, NSF MCB2032259 and Amgen. 274

FIGURES & CAPTIONS individual protomers and assembled heterodimer pair D9 & D10. In the combined plot A is added in 1:1.2 excess. 322

Supplementary Figures

Supp. Figure 1 | Computational benchmarking A) Comparison of design trajectories between relaxed-sequence hallucination and MCMC based design. X-axis is linear scaled up until 100 steps and then logarithmic. Both relaxed sequence and MCMC was performed using the ColabDesign framework. A mutation rate of 3 was used for MCMC. B) Time measured to perform one design trajectory using the CD framework on an Nvidia A100. For logits hallucination 100 iterations were used while for MCMC 2000 iterations with a mutation rate of 3 was used. C) Three design trajectories of 200 steps of hallucination performed with the ‘hard’ protocol using ColabDesign.

Supp. Figure 2 | Additional experimental verification. A) Laser scanned images of SDS-PAGE gels analyzing unconditional monomers with 100, 200 and 300 AAs. L indicates the ladder, while the sample bands are elution fractions after IMAC. B) Laser scanned images of SDS-PAGE gels analyzing unconditional monomers with 600 AAs size. E1-E4 indicate the elution fractions after IMAC, while FT shows the flow through C) Experimental testing of AF only generated sequences. TOP: Comparison of expression yield. Yield of protein expression is determined as amount of soluble proteins measured with nanodrop after expression and IMAC purification. MIDDLE: CD spectra of MPNN redesigned backbones. BOTTOM: HPLC SEC traces of proteins indicating their monomeric status.

Supp. Figure 3 | SDS-PAGE gels of Homo Oligomers and CD-spectra A) Laser scanned images of SDS-PAGE gels showing experimental success of homo oligomer expression and IMAC purification. P = Pellet ; S = Supernatant ; FT = IMAC Flow Through ; E = IMAC elusion fraction ; L = Ladder B) CD spectra of homo oligomers C) CD-Spectra of heterodimer protomers measured only at room temperature 348 349 350

Supp. Figure 4 | FSC curves of CryoEM densities FSC Curves created using RELION 4.1 for cryoEM reconstructions shown in Fig.: 5.

Materials & Methods Relaxed sequence hallucination

Hallucinations were performed using the ColabDesign framework either on Google Colab, Google Cloud or on a local installation. Relaxed sequence hallucination was done using the “design_logits” protocol. Softmax of the Gumbel distribution was used to initialize the sequences. Normalized gradient descent was used28. The standard loss function consistent of a pLDDT, PAE, contact and radius of gyration losses. Hallucination trajectories were run for 100 iterations. Relaxed sequence hallucination outputs were saved as pdb files with a placeholder sequence generated by taking the argmax of the sequence logits.

Multimer design

Homo-oligomers were designed using the ‘copies’ option in ColabDesign. This copies the input logits and introduces an index shift of 50 at the breakpoint between the copies. Additionally, the loss function was upgraded to feature interface PAE and contact losses.

Protein MPNN sequence design

To redesign sequences used for experimental testing a local installation of ProteinMPNN15 was used. Batch_size was set to 1 and a sampling temperature of 0.1 was used. Cysteines were omitted to reduce problems in expression. For computational benchmarking a custom JAX-implementation of ProteinMPNN was used (part of the ColabDesign Framework). Again, sampling temperature was 0.1 and the batch_size was set to 8.

Gene construction and protein expression

Synthetic DNA sequences encoding for the proteins were ordered from IDT as eblock or gblocks and cloned into a pet28b(+) vector using Gibson Assembly (NEBuilder, New England Biolabs). All constructs included a N-terminal His Tag and Heterodimers and 600mers additionally had a c-terminal strep tag. Constructs were cloned into E.coli (DH5a) and positive clones were isolated, miniprepped and plasmids were sequence verified. Verified plasmids were transformed into E.coli BL21 and expressed from a single colony overnight using a homemade autoinduction media (For 1 L: 1x TB media (Carl Roth), 2g lactose, 0.5g glucose). Expression was done in a shaking incubator at 37C in Thomson Ultra Yield flasks.

Protein purification

Overnight cultures were transferred to 50ml Falcon tubes and centrifuged for 15 min at 4600g to pellet bacteria. Supernatant was removed and the pellet was resuspended in B-Per bacteria lysis reagent (Thermo Fisher scientific) supplemented with Pierce Protease inhibitor (Thermo Fisher scientific). The lysis reaction was carried out for 15min at room temperature with gentle shaking. Lysed cells were pelleted for 10 min at 20.000g. The supernatant was collected and applied to a equilibrated immobilized metal affinity gravity flow column. Column equilibration was done by two column volumes (CV) water and 2 CV wash buffer (1x PBS supplemented with 20 mM imidazole). Flow through was applied a second time to increase yield. The column was washed with 10 CV wash buffer and then eluted using four 0.5 CV elution steps with Elution buffers 1-4 (all four 1x PBS and 100, 200, 300 and 500 mM imidazole). All fractions were analyzed using SDS-PAGE. Eluted proteins were desalted using either Amicon Ultra centrifugation filters (Merc Millipore) or Zeba Spin desalting columns (Thermo Fisher scientific) and concentrations were obtained using absorbance at 280nm.

CD-spectra measurements

CD-spectra were obtained using a Jasco-815 spectrophotometer. A cuvette with 1mm pathlength was used. Proteins were buffer exchanged into ddH20 and measured at a concentration of 0.3 mg/ml. Measurements at 95C were done by heating up the instrument and samples were allowed to incubate at 95C for 5 minutes before measurement.

Size exclusion chromatography (SEC)

HPLC-SEC was performed by using a HPLC equipped with a Yarra 3000 HPLC column at a flow rate of 1ml/min and 90-94 bar pressure. Phosphate buffered Saline 415 (PBS) was used as liquid phase. FPLC-SEC was performed on a Äkta Pure 25 using 416 a Superdex 75 Increase 100/300 column with a flow rate of 0.5ml/min at 4C. Again, 417 PBS was used as liquid phase. 418 419 Preparation of Vitrified Specimens 420 Cryo-electron microscopy (cryoEM) grids, specifically Quantifoil 200-mesh copper 421 grids with R1.2/1.3 holey carbon support films, were first glow discharged for 90 422 seconds using high-pressure air. The sample was subsequently applied onto the grid 423 within the Vitrobot Mark IV chamber (FEI). Prior to sample application, the chamber 424 was adjusted to 100% humidity at 4°C. After the sample was applied, excess solution 425 was blotted for 3 seconds using a blot force of 20, following which the grid was 426 immediately plunged into liquid ethane for vitrification. 427 428 Data acquisition 429 Cryo-electron microscopy data was acquired from three different samples, P005 430 (2800 movies), P008 (3356 movies), and P009 (4211 movies), using a Titan Krios 431 microscope operated at 300 kV (ThermoFisher Scientific) equipped with a Falcon3 432 direct electron detector and a CS corrector. Movies were taken in nanoprobe mode 433 with a 50 µm C2 aperture and a 100 µm objective aperture, and magnified pixel size 434 of 0.67 Å. Each movie consisted of 50 frames with a total dose of 50 e−/Å2, and an 435 exposure time of 35 seconds with a dose rate of 0.65 e−/pix/s on the detector. Data 436 collection was automated using EPU (ThermoFisher Scientific). 437 438 Data processing 439 All the movies collected were subjected to motion correction using motioncor2. CTF 440 estimation was performed using CTFFIND-4.1 software on the non-dose-weighted 441 micrographs. Subsequently, particles were picked using a Laplacian-of-Gaussian 442 automated picking routine on the dose-weighted micrographs. The particles were 443 extracted in RELION 4, using a box size of 220 pixels. A low-pass filter was used to 444 reduce the high-frequency details in the predicted atomic model and smooth out its 445 shape to make the initial model. For samples P005, P008, and P009, 1.6 million, 1.8 446 million, and 2.4 million picked particles were subjected to rounds of 2D and 3D 447 448 449 450 451 452 classification, which resulted in a homogeneous set of projections. This process led to 1.5 million, 1.4 million, and 1.9 million particles being used for the final refined job in RELION for proteins P005, P008, and P009, respectively. Additional rounds of classification and refinement could further enhance the resolution of the final refined jobs. The final consensus refinement produced a structure with resolutions of 5.9 Å, 453 respectively. 454

Nature 596, 583–589 (2021). 774–780 (2023). 377, 387–394 (2022).

56–61 (2022). 5. Wang, J. et al. Scaffolding protein functional sites using deep learning. Science

Nature 600, 547–55 2 (2021 ). (2022). framework based on

AlphaFold. 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 Generative Models for De Novo Protein Design. 2022.01.27.478087 Preprint at design by inversion of the AlphaFold structure prediction network. (2022) 13. Bengio, Y., Léonard, N. & Courville, A. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. Preprint at 14. Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl. Acad. Sci. 118, e2017228118 (2021). 15. Dauparas, J. et al. Robust deep learning–based protein sequence design using

ProteinMPNN. Science 378, 49–56 (2022). 16. Lin, Z. et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. 2022.07.20.500902 Preprint at 17. Watson, J. L. et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. 2022.12.09.519842 Preprint at https://doi.org/10.1101/2022.12.09.519842 18. Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004). 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 19. Minkyung Baek [@minkbaek]. Adding a big enough number for ‘residue_index’ feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/ residue_index modification). #AlphaFold #alphafold2 https://t.co/TX1PnRk5Wd.

Twitter https://twitter.com/minkbaek/status/141753829170907136 2 (2021 ). determination in RELION-3. eLife 7, e42166 (2018). 4, 43–44. 24. Ingraham, J. et al. Illuminating protein space with a programmable generative model. 25. Wu, K. E. et al. Protein structure generation via folding diffusion. Preprint at 26. Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models. Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models https://nanand2.github.io/proteins/. 523 524 525 526 527

1. Jumper , J. et al. Highly accurate protein structure prediction with AlphaFold . 2. Baek , M. et al. Accurate prediction of protein structures and interactions using a three-track neural network . Science 373 , 871 - 876 ( 2021 ). oligomeric Proteins. Biochemistry 62 , 358 - 368 ( 2023 ). 3. Gerben , S. R. et al. Design of Diverse Asymmetric Pockets in De Novo Homo4. Yeh , A. H.-W . et al. De novo design of luciferases using deep learning . Nature 614 , 6 . Wicky , B. I. M. et al. Hallucinating symmetric protein assemblies . Science 378 , 7 . Roney , J. P. & Ovchinnikov , S. State-of-the-Art Estimation of Protein Model Accuracy Using AlphaFold . Phys. Rev. Lett . 129 , 238101 ( 2022 ). 8. Anishchenko , I. et al. De novo protein design by deep network hallucination. 9. Verkuil , R. et al. Language models generalize beyond natural proteins . 2022 . 12 .21.521521 Preprint at https://doi.org/10.1101/ 2022 .12.21.521521 10. Jendrusch, M. , Korbel , J. O. & Sadiq , S. K. AlphaDesign: A de novo protein design 11 . Moffat , L. , Kandathil , S. M. & Jones , D. T. Design in the DARK: Learning Deep 12 . Goverde , C. , Wolf , B. , Khakzad , H. , Rosset , S. & Correia , B. E. De novo protein 20. Zivanov , J. et al. New tools for automated high-resolution cryo-EM structure 21 . Scheres , S. H. W. & Chen , S. Prevention of overfitting in cryo-EM structure determination . Nat. Methods 9 , 853 - 854 ( 2012 ). 22. Pettersen , E. F. et al. UCSF ChimeraX: Structure visualization for researchers, educators, and developers . Protein Sci. Publ. Protein Soc . 30 , 70 - 82 ( 2021 ). 23. Pavel V. Afoninea , Jeffrey J . Headdb, Thomas C. Terwilligerc and Paul D. Adamsa ,d. New tool: phenix. real_space_refine. Comput. Crystallogr. Newsl . 2013 27. Kempen, M. van et al. Foldseek: fast and accurate protein structure search . 2022 . 02 .07.479398 Preprint at https://doi.org/10.1101/ 2022 .02.07.479398 28. Cortés, J. Finite-time convergent gradient flows with applications to network