Intrinsically disordered protein regions (IDRs) pervasively engage in essential molecular functions, yet they are often poorly conserved as assessed by sequence alignment. To understand the seeming paradox of how sequence variability is compatible with function, we examined the functional determinants for a poorly conserved but essential IDR. We show that IDR function depends on two distinct but related properties: sequence- and chemical specificity. While sequence-specificity works via linear binding motifs, chemical-specificity reflects the sequence-encoded chemistry of multivalent interactions through amino acids across an IDR. Unexpectedly, an apparently essential binding motif can be removed if compensatory changes to the sequence chemistry are made, highlighting the orthogonality and interoperability of both properties. Our results provide a general framework to understand the functional constraints on IDR sequence evolution. One-Sentence Summary: Interactions driven by intrinsically disordered regions can be understood using a two-dimensional landscape that defines binding via motif-dependent and motif-independent contributions.
Intrinsically disordered proteins and protein regions (collectively referred to as IDRs) play important and often essential roles in many biological processes across all three kingdoms of life (1). IDRs frequently engage in molecular interactions, and their inherent structural plasticity allows them to bind through a variety of mechanisms (2, 3). As such, understanding how IDR sequences determine the modes of molecular recognition is key to mapping from sequence to function.
Despite their importance for function, IDRs are often poorly conserved as assessed by
alignment-based methods (4–7). The notable exceptions to this are short linear motifs (SLiMs),
conserved stretches of 5 to 15 amino acids that define sequence-specific recognition sites (
IDRs can evolve more rapidly than folded domains (4, 6). We therefore anticipated that S.
cerevisiae proteins may possess conserved folded regions alongside less well-conserved IDRs
(Fig. 1A). To assess this, we performed a systematic analysis of sequence conservation and
predicted disorder across the S. cerevisiae proteome (see Methods) (
We sought to identify a model protein to test our functional conservation hypothesis. Because
long IDRs are abundant in the context of chromatin-associated proteins (fig. S6) (
Abf1 is an essential S. cerevisiae GRF (
We first established which Abf1 regions were essential for function. Previous studies on Abf1
focused on the DBD, although truncation experiments identified apparently essential C-terminal
sequences (CS1/2) in IDR2 (
Our molecular dissection revealed that IDR2 but not IDR1 is essential (Fig. 1F). The maximal
C-terminal IDR2 truncation, IDR2449-662 (Fig. 1G), that was reported to be viable in the presence
of IDR1 (
Based on prior work, we anticipated that conservation in IDRs could be considered in terms of
compositional and linear sequence conservation (
We assumed that the relative compositional conservation of IDR2 would provide the molecular basis for our functional conservation hypothesis and anticipated that orthologous IDRs would support viability in S. cerevisiae. To test this, we took eighteen Abf1 orthologs, identified their IDRs corresponding to S. cerevisiae IDR2449-662 from the full-length proteins (Fig. 2C), and replaced the S. cerevisiae IDR2449-662 in our test plasmid with each of these orthologous IDR2s. We then tested the resulting chimeric constructs in our plasmid shuffle assay (Fig. 2D). Unexpectedly, outside of the sensu stricto S. cerevisiae complex (the bottom four species in Fig. 2D), only two of the fifteen orthologous IDR2s were viable (Fig. 2D), with no obvious relationship between sequence composition, sequence identity, or sequence length and function (Fig. 2D, fig. S11). To our surprise, our expectation that compositionally conserved IDRs would be functionally conserved proved incorrect.
We next wondered whether IDRs from proteins with similar functions might confer viability. We tested several candidates with similar IDR amino acid compositions, including IDRs from Abf1 (IDR1), other GRFs (Rap1, Mcm1), a yeast transactivator (Gcn4), and a human insulator (CTCF) (Fig. 2E, fig. S12). We also tested unrelated but compositionally similar low-complexity IDRs from the human RNA binding protein FUS and the yeast translation termination factor Sup35 (Fig. 2E, fig. S12). All of these IDRs failed to confer viability, suggesting that Abf1’s IDR2 provides specific molecular recognition.
To further understand how variations in IDR2 influence function, we generated a library of 48
randomly mutagenized IDR2 variants (Fig. 2F, fig. S13-S14). Surprisingly, this analysis
simultaneously revealed remarkable robustness with respect to some mutations and sensitivity
with respect to others. A variant with 56 point mutations was viable and another with 13 was not.
Statistically speaking, in the limit of our sample sizes, the mutational burden is not strongly
predictive of viability (Fig. 2G, fig. S15), and a linear conservation analysis reveals “conservedâ€
residues distributed approximately evenly across the sequence (Fig. 2H). Paradoxically, our
results thus far imply that IDR2 is (i) simultaneously robust and sensitive to mutations, and (ii)
compositionally relatively well-conserved yet cannot be replaced by most orthologs.
Although most orthologous IDR2s could not rescue viability, we unexpectedly identified several
IDRs from other S. cerevisiae proteins that conferred viability in place of IDR2 (Fig. 3A). These
included compositionally similar IDRs from the yeast transactivators Gal4 and Pho4, and from
the GRF Reb1 (fig. S16). Gal4, Pho4, and Reb1 are DNA binding proteins that can trigger
chromatin opening in vivo (
Considering Gal4, Pho4, and Reb1 can mediate chromatin remodeling (one of Abf1’s functions), we wondered if a common SLiM for recruiting the requisite machinery may be shared across these IDRs. Given SLiMs depend on their specific linear sequence (8), we reasoned that shuffling a SLiM would disrupt its function. As an initial test, we designed three globally shuffled variants with the conserved positions in Fig. 2H held fixed (Fig. 3B). All global shuffles were inviable, demonstrating unequivocally that IDR2-like composition is insufficient for viability, implicating linear sequence-specific regions that must be essential. In support of this inference, almost all of the composition-dependent sequence features are calculated for the inviable vs. viable sequences in Fig. 2F were statistically indistinguishable from one another (fig. S13). Taken together, our evolutionary comparisons and random mutagenesis all paint a picture in which composition may matter, but is not sufficient for function.
These analyses – and especially the global shuffle variants – implied the presence of a
non-conserved motif. To identify this putative motif we developed an unbiased approach termed
sequential sequence shuffling (Fig. 3C). IDR2 was subdivided into non-overlapping 30-residue
windows, and the sequence in each window was locally shuffled. This revealed two central
windows that did not tolerate shuffling, which we confirmed by shuffling everything except the
central 60-residue subsequence (Fig. 3C). We then repeated the procedure using 10-residue
windows within the 60-residue subsequence. We identified a 20-residue subsequence (the
essential motif, EM) that could not be shuffled and was essential for IDR2 function (Fig. 3C).
Gratifyingly, the essential motif overlaps with the most conserved region in our random
mutagenesis (Fig. 3D). While this region is unremarkable with respect to other sequence
properties and not conserved across orthologs, it is predicted to form a transient helix, a feature
frequently associated with IDR-mediated interactions (fig. S17) (
Given modular SLiMs should confer function when inserted into a neutral context, we tested if the essential motif met this definition. As our neutral context, we selected the phosphomimetic variant of the low-complexity IDR from the human RNA binding protein FUS (FUS12E) (63) (fig. S18). FUS12E is a compositionally-uniform low-complexity disordered region that lacks secondary structure or known binding motifs (63, 64). However, FUS12E contains uniformly spaced hydrophobic (aromatic) and acidic residues, making it an ideal neutral IDRs with similar sequence properties to IDR2. While FUS12E alone was inviable (Fig. 2E), insertion of the essential motif into the FUS12E context “transformed†the FUS12E sequence to make it viable. This result confirmed the validity of the essential motif as a true modular SLiM: it cannot be shuffled and can confer function when placed in another context (Fig. 3E).
Might regions homologous to the essential motif exist in the other functional IDRs identified in Fig. 3A? A global sequence alignment between IDR2 and Gal4768-881 was relatively poor, with only one sub-region showing alignment (fig. S19). Despite its low identity, this alignment revealed two remotely homologous regions, which we named Abf1G4-like and Gal4G4 (Fig. 3F). We wondered if the Abf1G4-like or Gal4G4 subsequence contained bona fide modular SLiMs. To test this, as with the essential motif, we inserted Abf1G4-like or Gal4G4 into the FUS12E context where both conferred viability (Fig. 3G). To confirm this result we tested the Gal4G4 in three more otherwise inviable IDR contexts (Fig. 3G). Convincingly, in each case the 17-residue Gal4G4 subsequence conferred viability.
How could all three of these quite different subsequences confer viability? If the essential motif
includes a hydrophobic-faced transient helix (fig. S17), we speculated that an alternative
hydrophobic-faced transient helix might also work. Accordingly, we designed an IDR2
alternative by inserting the 24-residue transient helix from the human RNA binding protein
TDP-43 (fig. S20) into FUS12E (
Given the importance of context for Gal4G4 (Fig. 3I), we wondered if context mattered for other motifs. While context shuffling was tolerated in wildtype IDR2 (Fig. 3C), a rationally designed IDR2 variant with reduced hydrophobicity outside of the essential motif was inviable, as was a serendipitous mutant generated in our random mutagenesis where the essential motif was unaltered (Fig. 3J). Similarly, for our rationally designed FUS12E+TDP-43 variant, the aromatic residues in the context were essential for viability (Fig. 3J). Collectively, our results confirm that appropriate context chemistry is essential, revealing a second determinant of Abf1 IDR2 function.
Fig 3: Sequence motifs play a key role in IDR2 function. (A) While most orthologs are inviable, several entirely unrelated IDRs can confer viability. (B) Sequence shuffles with ‘conserved’ positions identified by random mutagenesis (Fig. 2H) fixed are inviable, demonstrating that even with the protection of potentially conserved residues, IDR2 composition alone is insufficient for viability. (C) Sequential sequence shuffling pin-points an essential motif (EM) in the center of IDR2. (D) The essential motif is not conserved across orthologs, is indistinct in terms of sequence features, yet it emerges as the most conserved subsequence in our random mutagenesis (solid bars in mutagenesis subfigure). (E) Insertion of the essential motif into a non-functional IDR (FUS12E1-163) transforms that IDR to be functional. (F) Three putative motifs are shown in the context of their IDRs. Abf1G4-like and Gal4G4 were identified by sequence alignment of IDR2 and Gal4768-881 (fig. S21). (G) Abf1G4-like and Gal4G4 confer viability introduced into FUS12E. (H) The transient, hydrophobic helix from TDP-43 also confers viability inserted into FUS12E. (I) The FUS12E IDR context with GalG4 present is a key determinant of function. (J) The IDR2 sequence context around the essential motif is also a key determinant of function. (K) Compositionally matched subsequences taken from a range of transcription factors also provide viability inserted into FUS12E.
To confirm that Abf1G4-like and Gal4G4 contained bona fide SLiMs, we reasoned that an essential control experiment would be to take unrelated but length-matched subsequences with Abf1G4-like/Gal4G4 -like amino acid compositions and demonstrate that these were inviable. We identified five subsequences that were compositionally similar to Gal4G4 from both yeast and non-yeast transcription factor IDRs (fig. S21). If IDR2 function is SLiM-dependent, then these non-alignable subsequences from another species should be inviable, given that - other than composition - they were randomly selected. To our surprise, all six of these 17-residue subsequences were viable in the FUS12E context (Fig. 3K).
This result prompted us to step back and reconsider our data. Conventionally speaking, the ability to insert a short (<20 residue) sequence into a non-functional IDR context and confer function is interpreted as a simple and unambiguous demonstration of a bona fide modular motif (e.g., a SLiM). Given sequence-specific motifs are often defined by two characteristics (an inability to tolerate shuffling and autonomous modular activity), the essential motif is clearly a true motif (Fig. 3C, Fig. 3E). However, the finding that subsequences that were compositionally similar but unrelated in terms of the absolute sequence were also functional implied that we either had a remarkable ability to identify de novo motifs, or that something more general was at play.
Our results thus far identified two determinants of function: (i) the presence of a motif, and (ii)
the presence of a sequence context that we interpret to mediate distributed multivalent
interactions (
Given motifs are - by definition - sequence-specific, i.e., they depend on their contiguous linear amino acid sequence, it should not be possible to distribute a motif across the IDR context and maintain function. In keeping with this, a variant with the essential motif residues distributed across the FUS12E context was inviable (Fig. 4C). This variant can be interpreted as simultaneously disrupting the motif but also (modestly) enhancing the context through, for example, the hydrophobic residues that are redistributed (Fig. 4D, 1-to-2-to-3). Fig 4: IDR2 mediated interactions can be understood via a two-dimensional binding landscape. (A) IDR-mediated interactions can be understood in terms of motif binding and context binding. (B) The combination of motif and context binding can be projected onto a simple two-dimensional binding surface. (C) The essential motif is a true motif, in that it cannot be distributed across the IDR sequence. (D) Previous designs can be interpreted through the two-dimensional binding landscape. (E) Variants with distributed motifs identified by composition are functional in both FUS12E and Sup35 backgrounds. (F) Rational design of a motif-free FUS12E variant. (G) Sufficient acidity in the IDR context is essential for viability. (H) Viable and inviable sequences can be classified based on charge and binding scores, parameters based on the weighted sequence composition. Circles are sequences that lack an essential motif (or TDP-43 motif), squares are sequences that have an essential motif (or TDP-43 motif), and stars are completely synthetic sequences designed to titrate the space (fig. S25). Analogous plot for sequences generated by random mutagenesis shown in fig. S23. (I) Rational designs that should disrupt phase separation are viable, suggesting liquid-liquid phase separation may not play a role in Abf1 function. (J) Summary schematic (to be updated).
Our binding landscape model raised an intriguing possibility: What if GalG4 and the other sequences identified in Fig. 3K were not bona fide motifs, but instead altered the IDR context, albeit very locally, without being a true motif? To test this, we asked if variants where these sequences were distributed were viable. In all cases and over multiple distinct contexts we discovered that these distribution variants were viable, confirming our hypothesis (Fig. 4E). The functional yet motif-free sequences identified in Fig. 4E prompted a rational de novo design based on basic chemical principles. Given removing hydrophobic residues from contexts abolished viability (Fig. 3I, J) and given Gal4G4 and the other motifs in Fig. 4E must function by modulating the context, we reasoned this likely occurs through an increased number of hydrophobic residues. Accordingly, we asked if a FUS12E variant with additional evenly distributed hydrophobic residues (+4 tyrosine (aromatic), +3 methionine (aliphatic), as in Gal4G4) would be viable. Indeed, even though this design was wholly artificial and even though wild-type IDR2 requires a bona fide motif, this design was viable (Fig. 4F). This result strikingly confirms that SLiM-dependent function and function conferred by sequence chemistry can coexist on the same set of axes.
Finally, we asked if hydrophobicity was the only chemical feature in the IDR context that mattered. Our evolutionary analysis implicated that both hydrophobicity and acidity were similarly conserved (Fig. 2B). Accordingly, we tested if acidic residue depletion would compromise viability, which it did (Fig. 4G). Based on these principles, we developed a simple compositional-based metric that quantifies IDR sequences in terms of charge and binding score (see Methods). These two parameters effectively delineated 88 sequences, where almost the only examples of functional sequences that fall inside the inviable region possess bona fide motifs (Fig. 4H squares, fig. S24). To test the predicted boundaries, we designed a series of completely synthetic IDRs that straddle charge and binding scores (Fig. 4H stars, fig. S25) and confirmed the predictive power of our compositional-based metric for context viability. In summary, our results support a model in which two orthogonal axes (sequence-specific motifs and distributed multivalent sites) define a two-dimensional landscape in which sequence-to-function mapping in IDR2 can be interpreted.
Recent work has invoked intracellular phase transitions (and specifically liquid-liquid phase
separation, LLPS) to explain molecular principles underlying chromatin organization (
IDR-mediated interactions have generally been viewed through the lens of either
sequence-specific binding motifs (e.g., SLiMs) or distributed multivalent interactions. These
interaction modes are determined by sequence-specificity and chemical specificity, respectively,
and elegant work from many groups established the functional importance of both (
Our results can be rationally interpreted via a two-dimensional binding landscape (Fig. 4B). In
this model, the determinants of function reflect how IDR2 engages in intermolecular interactions
via some interoperable combination of motif-dependent and context-dependent binding. Based
on our molecular understanding, we were able to design de novo synthetic IDRs that were
functional, although dramatically different and wholly unrelated to the wildtype sequence. These
include variants with an established helical binding motif and multiple variants without motifs.
The latter multiplicity makes it highly unlikely that we involuntarily generated a new motif each
time and is all-the-more surprising as wild-type IDR2 depends on a specific motif. We note that
biophysical hints for this model are found in numerous in vitro reconstituted systems (
Our divergent de novo designs demonstrate the depth of our understanding of the functional determinants for Abf1 IDR2 (Fig. 3, 4). They also highlight the power of interpreting IDR-mediated interactions via our two-dimensional landscape. We suggest that many IDRs studied previously can be placed somewhere on the landscape, although the boundaries for function will naturally vary in a system-specific way. Finally, to help other groups identify essential subregions, we developed a computational tool for constructing sequential sequence shuffle libraries (https://pipit.readthedocs.io/).
Functions based on molecular interactions generally involve some degree of specificity, both
towards a partner of interest as well as against off-target inhibitory interactions. Specificity is
typically considered in terms of shape complementarity and chemical compatibility, two features
afforded by folded domains and, to a lesser extent, SLiMs (
In light of this evolutionary leeway, the limited functional conservation in IDR2 across orthologs
despite the conservation of amino acid composition and sequence features may seem surprising
(Fig. 2, fig. S29). To verify that full-length Abf1 performs analogous functions in other species,
we confirmed that full-length Abf1 from K. lactis is viable in S. cerevisiae (fig. S1) (
Our rational mutagenesis reveals IDR2 is remarkably sensitive to small perturbations (Fig. 2G), even though much larger sequence changes offer alternative and functional variants. We interpret the fact that IDR2 is on the edge of viability as a signature of sensitivity. If IDR2 were deep inside the ‘bound’ regime, tuning molecular function via PTMs or additional partners may be challenging. Instead, by sitting close to the conceptual midplane on our landscape between bound and unbound, the wildtype sequence is maximally sensitive (Fig. 4D).
We focused on function measured by viability in S. cerevisiae growing under low-challenge
laboratory conditions. This specific growth niche likely does not assess all facets of Abf1
function. The importance of other Abf1 regions and features in alternative growth conditions
remains unassessed so far. How might alternative motifs or sequence features contribute to IDR
function in Abf1? Ongoing work implies large-scale IDR-dependent remodeling of transcription
during glucose starvation, and IDRs have been proposed to function as intrinsic sensors of
intracellular state (
Competing interests: ASH is a scientific consultant for Dewpoint Therapeutics. All other authors declare no competing interests.
supplementary material, or online at
https://github.com/holehouse-lab/supportingdata/tree/master/2021/Langstein-Skora_2021.
The Python package PIPT (https://pipit.readthedocs.io/) allows for sequence libraries done
for sequential sequence shuffling to be generated automatically.
Materials and Methods
Supplementary Text
Figs. S1 to S31
References (
14. A. N. Amin, Y.-H. Lin, S. Das, H. S. Chan, Analytical Theory for Sequence-Specific Binary Fuzzy Complexes of Charged Intrinsically Disordered Proteins. J. Phys. Chem. B. 124, 6709–6720 (2020). 15. K. Bugge, I. Brakti, C. B. Fernandes, J. E. Dreier, J. E. Lundsgaard, J. G. Olsen, K. Skriver, B. B. Kragelund, Interactions by Disorder - A Matter of Context. Front Mol Biosci. 7, 110 (2020).
in yeast. EMBO J. 37, e97490 (2018). 63. Z. Monahan, V. H. Ryan, A. M. Janke, K. A. Burke, S. N. Rhoads, G. H. Zerze, R. O’Meally, G. L. Dignon, A. E. Conicella, W. Zheng, R. B. Best, R. N. Cole, J. Mittal, F. Shewmaker, N. L. Fawzi, Phosphorylation of the FUS low-complexity domain disrupts phase separation, aggregation, and toxicity. EMBO J. 36, 2951–2967 (2017). 64. K. A. Burke, A. M. Janke, C. L. Rhine, N. L. Fawzi, Residue-by-residue view of in vitro FUS granules that bind the C-terminal domain of RNA polymerase II. Mol. Cell. 60, 231–241 (2015).
Natl. Acad. Sci. U. S. A. 117, 27346–27353 (2020).