PEDS Advance Access originally published online on August 25, 2004
Protein Engineering Design and Selection 2004 17(7):589-594; doi:10.1093/protein/gzh067
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Site-directed protein recombination as a shortest-path problem
1Bioengineering Option and 3Division of Chemistry and Chemical Engineering, California Institute of Technology, Mail Code 210-41, Pasadena, CA 91125-4100, USA
2 To whom correspondence should be addressed. E-mail: endelman{at}caltech.edu; zgw{at}cheme.caltech.edu; frances{at}cheme.caltech.edu
| Abstract |
|---|
|
|
|---|
Protein function can be tuned using laboratory evolution, in which one rapidly searches through a library of proteins for the properties of interest. In site-directed recombination, n crossovers are chosen in an alignment of p parents to define a set of p(n + 1) peptide fragments. These fragments are then assembled combinatorially to create a library of pn+1 proteins. We have developed a computational algorithm to enrich these libraries in folded proteins while maintaining an appropriate level of diversity for evolution. For a given set of parents, our algorithm selects crossovers that minimize the average energy of the library, subject to constraints on the length of each fragment. This problem is equivalent to finding the shortest path between nodes in a network, for which the global minimum can be found efficiently. Our algorithm has a running time of O(N3p2 + N2n) for a protein of length N. Adjusting the constraints on fragment length generates a set of optimized libraries with varying degrees of diversity. By comparing these optima for different sets of parents, we rapidly determine which parents yield the lowest energy libraries.
Keywords: dynamic programming/laboratory evolution/optimization/protein design/recombination
| Introduction |
|---|
|
|
|---|
Protein design seeks the amino acid sequence that encodes a protein with a desired set of properties (DeGrado, 2001
Sequence design is sometimes called rational design, but many aspects of laboratory evolution are also rationally designed (Kamtekar et al., 1993
; Kono and Saven, 2001
; Voigt et al., 2001
; Moore and Maranas, 2002
), including the library diversity. By diversity we mean how proteins in the library differ from the parents and from each other. The design goals and library creation method should dictate library diversity, but our understanding of this subject is still very limited. In several studies the number of mutations m has been correlated with functional change (Zaccolo and Gherardi, 1999
; Daugherty et al., 2000
; Ostermeier, 2003
; Otey et al., 2004
). We use the corresponding library average
m
as an example of an empirically useful diversity measure.
Although diversity is needed to effect changes in protein function, it is at odds with the need for stably folded proteins (a prerequisite for most functions). Since most mutations are neutral or disruptive to protein structure, the fraction of stably folded proteins in a library tends to decrease with diversity (Daugherty et al., 2000
; Guo et al., 2004
). Equivalently, for an energy function that scores protein stability (Gordon et al., 1999
; Saraf et al., 2004
), the average energy of all proteins in a library
E
tends to increase with diversity. This type of tradeoff is common in design problems with conflicting performance objectives. Our design strategy involves finding libraries on the optimal energydiversity tradeoff surface.
We apply this strategy to site-directed recombination (SDR), in which n crossovers are chosen in an alignment of p structurally-related parents to define a set of p(n + 1) peptide fragments. These fragments are then assembled combinatorially to create a library of pn+1 chimeric proteins (Figure 1). SDR is a relatively new approach to recombination in laboratory evolution (Richardson et al., 2002
; Hiraga and Arnold, 2003
; Meyer et al., 2003
). Other strategies do not involve a specific choice of crossovers (Stevenson and Benkovic, 2002
). Instead, they attempt to generate all possible crossovers using techniques from molecular biology. In practice, however, these methods are often limited in either the number or locations of crossovers (Joern et al., 2002
).
|
SDR benefits from the unique properties of recombination as an evolutionary search strategy. Recombination of homologs is highly effective at neutral evolution because the mutations it introduces have been selected by nature to be compatible with the protein structure, albeit in a different genetic background. A recent SDR experiment identified several functional ß-lactamases with over 50 mutations (Meyer et al., 2003
To optimize SDR libraries for a given set of parents and fixed number of crossovers n, we minimize the average energy of all chimeras
E
, subject to constraints on the length L of each peptide fragment:
![]() | (1) |
| Materials and methods |
|---|
|
|
|---|
Theoretical energies
As with other global optimization algorithms for protein design (Dahiyat and Mayo, 1996
; Gordon and Mayo, 1999
), RASPP can use any energy function with single and pairwise interactions between residues:
![]() | (2) |
k(Sk), is determined by the parent Sk inherited at that position. For a fixed set of crossovers, the average energy of all chimeras in the library can be written as a sum over inheritance patterns:
![]() | (3) |
Consider two libraries, one with k1 crossovers and the other with k crossovers, whose first k1 crossover locations are identical. The difference in average energy between these libraries is
![]() | (4) |
![]() | (5) |
![]() | (6) |
Computational energies
To generate computational results we used SCHEMA disruption, an instance of Equation 2 that counts the number of pairwise interactions broken by recombination (Voigt et al., 2002
; Silberg et al., 2004
):
![]() | (7) |
The contact matrix Cij depends solely on structural information, while
ij uses only the parental sequence alignment. Specifically, Cij = 1 if residues i and j are within 4.5 Å in the parental structure; otherwise Cij = 0. The delta function
ij = 0 if the amino acids
i(Si) and
j(Sj) that are found in the chimera are also present at homologous positions in any single parent. Otherwise, the i j interaction is considered broken and
ij = 1.
SCHEMA disruption has proven useful in guiding SDR of ß-lactamases (Hiraga and Arnold, 2003
; Meyer et al., 2003
) and cytochromes P450 (Otey et al., 2004
), the two systems explored in this study. For SDR of ß-lactamases, we used the crystal structure of TEM-1 (1BTL.pdb) (Jelsch et al., 1993
) and generated a structural alignment of 263 residues with its homolog PSE-4 (1G68.pdb) (Lim et al., 2001
) using Swiss-Pdb Viewer (Guex and Peitsch, 1997
). TEM-1 and PSE-4 have 43% sequence identity. For SDR of cytochromes P450, we used the crystal structure of CYP102A1 (1JPZ.pdb) (Haines et al., 2001
) from Bacillus megaterium, commonly known as P450 BM-3, and generated sequence alignments with Bacillus subtilis homologs CYP102A2 (ORF BG12871) and CYP102A3 (ORF BG12299) using CLUSTALW (Thompson et al., 1994
). Pairwise sequence identities for these P450 homologs are around 65% based on 456 residues. Cytochrome P450 residues are numbered as in 1JPZ.pdb.
Measuring length
Since library composition is invariant with respect to crossover location within a contiguous region of conserved residues, we do not consider conserved residues as potential crossover locations. This effectively reduces the length of the protein (N) by the number of conserved residues. To be consistent we measure fragment length L by the number of residues not conserved across all parents.
Measuring diversity
The mutation level m of each chimera is the minimum number of amino acid changes needed to convert the protein into one of the parents. The average number of mutations
m
in a library with pn+1 chimeras is (
imi)/pn+1.
| Results |
|---|
|
|
|---|
Equation 1 is a shortest-path problem
Every feasible n-crossover library, i.e. one that satisfies the length constraints in Equation 1, can be represented as an n-path from node 0 to column n in the directed graph of Figure 2. The node Xk visited in column k
n corresponds to the position of the kth crossover. To create a one-to-one correspondence between the set of all n-paths and the set of all feasible n-crossover libraries, nodes in adjacent columns are selectively connected. A path that visits node X1 in the first column defines the first peptide fragment as [1, X1] (amino acid residues 1 to X1, inclusive), which has length X1. Thus node 0 is connected to all nodes in the first column that satisfy Lmin
X1
Lmax. Similarly, an arc from node X1 in the first column to node X2 in the second column defines the second peptide fragment as [X1 + 1, X2], which has length X2 X1. Thus node X1 is connected to node X2 if and only if Lmin
X2 X1
Lmax. This process is continued until the last column, where an arc from Xn 1 to Xn defines two peptide fragments: [Xn 1 + 1, Xn] and [Xn + 1, N] for a protein of length N. Thus node Xn 1 is connected to node Xn if and only if Lmin
Xn Xn1
Lmax and Lmin
N Xn
Lmax.
|
Once the arc connections are specified, arc lengths are assigned so that the total length of each n-path equals the average energy of the corresponding library:
![]() | (8) |
![]() | (9) |
![]() | (10) |
In summary, there is a one-to-one correspondence between feasible libraries and n-paths, and the total length of each n-path equals the average energy of the corresponding library. Therefore, finding the shortest n-path is equivalent to solving Equation 1.
Complexity of finding the shortest path
Shortest-path problems can be solved efficiently because of their recursive structure (Lawler, 1976
; Korte and Vygen, 2002
). In the case of Figure 2, the length of the shortest path
from node 0 to node j in column k can be computed using the shortest paths from node 0 to all nodes in column k 1:
![]() | (11) |
No information from other columns is needed. This property is the basis for dynamic programming. Using forward induction, RASPP finds the shortest path to every node in the first column, then the shortest path to every node in the second column, etc. Each evaluation of Equation 11 requires O(N) operations. This is repeated for all O(N) nodes in a column and for each of the n columns, yielding a running time of O(N2n).
In the process of finding the shortest n-path, RASPP also finds the shortest path to every column k
n. These path lengths do not exactly correspond to the solution of Equation 1 with k crossovers because the set of arc connections to the last column must satisfy a different set of constraints, as discussed above. To find optimal libraries with any fixed number of crossovers k
n, the arc connections between column k 1 and column k are updated and Equation 11 is solved O(N2) times as before. This can be repeated for all values k
n with a total running time of O(N2n), the same as a single iteration of RASPP.
Although RASPP libraries are provably optimal when diversity is measured by fragment length, often we are interested in optimality with respect to other diversity measures, such as the average level of mutation
m
. To use fragment length as a surrogate for
m
, we vary the length constraints over the entire range of feasible libraries: Lmin = 1 to N/(n + 1) and Lmax = N/(n + 1) to N nLmin, solving Equation 1 O(N2/n) times. This runs quickly since the arc lengths are not recalculated for each iteration. The ranges for Lmin and Lmax can easily be modified to accommodate experimental constraints arising from the library creation method.
RASPP libraries are optimized with respect to
m
To illustrate RASPP, consider using three cytochrome P450 homologs (CYP102A1/A2/A3, N = 197 non-conserved residues) for the laboratory evolution of novel catalytic properties (Otey et al., 2004
). By varying the length constraints for n = 7 crossovers, we generated 2052 libraries, of which 391 are distinct (Figure 3). As Lmin increases and Lmax decreases, the crossovers become more evenly spaced, resulting in libraries with higher
E
and higher
m
. Designing libraries with more crossovers increases the levels of diversity accessible by SDR, but adding fragments also complicates construction of the library. In this example, the choice of n = 7 provides enough mutants for screening (38 = 6561 chimeras) and sufficiently high levels of mutation for laboratory evolution (nearly
m
= 75) based on data from previous experiments (Otey et al., 2004
).
|
The lowest-energy RASPP libraries at increasing values of
m
define a RASPP curve (Figure 3). To determine how well RASPP curves approximate the optimal energydiversity tradeoff surface, we enumerated all four-crossover libraries for cytochromes P450 CYP102A1/A2 and ß-lactamases TEM-1/PSE-4 (25 x 106 and 20 x 106 libraries, respectively; Figure 4). At most levels of mutation, the RASPP curve provides a good estimate of the lowest energy possible. Exceptions occur in mutation ranges where RASPP does not produce any libraries, e.g. around
m
= 30 for the cytochromes P450. A similar mutation gap can be seen in Figure 3 at 40 <
m
< 50. Such gaps are to be expected when using constraints on fragment length as a surrogate for
m
. Changing the parents or the number of crossovers can shift the location of a gap, as seen by comparing Figures 3 and 4 (which differ in both respects).
|
The pattern of optimal crossovers varies dramatically along a RASPP curve. Figure 5 shows the elements of secondary structure for CYP102A1 (Ravichandran et al., 1993
m
, RASPP favors the ends of the protein to minimize structural disruption. The resulting chimeras inherit a single, large fragment from one parent and most of the remaining fragments contain only a few residues. To create libraries with higher
m
, RASPP must spread out the crossovers and penetrate the middle of the polypeptide chain.
|
Many of the fragments chosen are not intuitive RASPP frequently cuts through secondary structure motifs. For example, the most commonly chosen crossover region (residues 214217, which shows up as a long horizontal black line in Figure 5) lies in the middle of a long
-helix covering the substrate binding pocket. Two other consistently good regions for recombination (residues 248255 and 256276) are also helical. Previous computation-guided experiments have verified that site-directed recombination within secondary structure elements often yields folded proteins (Voigt et al., 2002RASPP curves for parental design
Before choosing optimal crossover locations, one must decide upon a set of parents for recombination. RASPP curves provide a rapid and reliable way of determining which parents yield the lowest energy libraries in a desired diversity range. To illustrate, consider choosing which combination of cytochrome P450 homologs (A1/A2, A1/A3 or A1/A2/A3) is best for laboratory evolution. Even though a library with three parents has more chimeras than one with two parents, the comparison is fair because any random, experimental sample will on average have the same
E
as the entire library.
The RASPP curves for these alternative designs reveal significant differences at mutation levels
m
> 40 (Figure 6). For 40 <
m
< 60, the combination A1/A2 is better than A1/A3 because the former has lower energy. This would be difficult to ascertain by other means, since A1/A2 and A1/A3 both have 65% sequence identity and their non-conserved residues have the same surface accessibility on average (1.5 contacts per non-conserved residue). For 40 <
m
< 50, A1/A2 also has lower energy than A1/A2/A3. For 50 <
m
< 60, A1/A2 and A1/A2/A3 have comparable energy, but A1/A2 is still preferable because adding a third parent increases the cost and complexity of library construction. All three parents are needed to build libraries with
m
> 60.
|
| Discussion |
|---|
|
|
|---|
We have presented computational results using a residue-based energy function called SCHEMA disruption because it is simple yet effective at identifying folded chimeras (Meyer et al., 2003
Equation 6 is our key theoretical result which shows that dynamic programming can be used for SDR library design. In this respect, Equation 6 is analogous to dead-end elimination (DEE) theorems (Desmet et al., 1992
; Goldstein, 1994
; Looger and Hellinga, 2001
), which have led to many successes in protein sequence design (Dahiyat and Mayo, 1996
; Looger et al., 2003
). However, Equation 6 and the DEE theorem have very different consequences for computational protein design. RASPP finds the global energy minimum (for Equation 1) in O(N3p2 + N2n) operations for a protein of length N, making it efficient in theory and practice (Papadimitriou and Steiglitz, 1998
). In contrast, DEE requires an exponential number of operations O(aN) in the worst case. This is unavoidable (unless P = NP) because finding the amino acid sequence with minimum energy is NP-hard (Pierce and Winfree, 2002
). By averaging over the library, we transform protein design from a hard problem to an easy one.
Our tractable formulation of SDR library design also depends on constraining fragment diversity to effect changes in library diversity. We have described RASPP using fragment length, but RASPP can use any measure of fragment diversity compatible with the arc connections in Figure 2. We have shown that constraints on fragment length are effective when library diversity is measured by the average number of mutations
m
. As we learn more about the effects of library diversity on protein evolution, other measures of fragment diversity will be needed.
| Acknowledgments |
|---|
We thank Costas Maranas, Matt DeLisa, Jeff Saven and Niles Pierce for their comments on the manuscript. This work was supported by National Institutes of Health Grant R01 GM068664-01, Army Research Office Contract DAAD19-03-D-0004, the W. M. Keck Foundation, a National Defense Science and Engineering Graduate Fellowship (J.B.E.) and NIH Fellowship F32 GM64949-01 (J.J.S.).
| References |
|---|
|
|
|---|
Arnold,F.H. (ed.) (2000) Evolutionary Protein Design. Academic Press, San Diego.
Dahiyat,B.I. and Mayo,S.L. (1996) Protein Sci., 5, 895903.[Abstract]
Dahiyat,B.I. and Mayo,S.L. (1997) Science, 278, 8287.
Daugherty,P.S., Chen,G., Iverson,B.L. and Georgiou,G. (2000) Proc. Natl Acad. Sci. USA, 97, 20292034.
Deem,M.W. and Bogarad,L.D. (1999) Proc. Natl Acad. Sci. USA, 96, 25912595.
DeGrado,W.F. (2001) Chem. Rev., 101, 30253026.[CrossRef][ISI][Medline]
DeGrado,W.F., Summa,C.M., Pavone,V., Nastri,F. and Lombardi,A. (1999) Annu. Rev. Biochem., 68, 779819.[CrossRef][ISI][Medline]
Desmet,J., De Maeyer,M., Hazes,B. and Lasters,I. (1992) Nature, 356, 539542.[CrossRef]
Goldstein,R.F. (1994) Biophys. J., 66, 13351340.
Gordon,D.B. and Mayo,S.L. (1999) Structure, 7, 10891098.[Medline]
Gordon,D.B., Marshall,S.A. and Mayo,S.L. (1999) Curr. Opin. Struct. Biol., 9, 509513.[CrossRef][ISI][Medline]
Guex,N. and Peitsch,M.C. (1997) Electrophoresis, 18, 27142723.[CrossRef][ISI][Medline]
Guo,H.H., Choe,J. and Loeb,L.A. (2004) Proc. Natl Acad. Sci. USA, 101, 92059210.
Haines,D.C., Tomchick,D.R., Machius,M. and Peterson,J.A. (2001) Biochemistry, 40, 1345613465.[CrossRef][Medline]
Hiraga,K. and Arnold,F.H. (2003) J. Mol. Biol., 330, 287296.[CrossRef][ISI][Medline]
Holland,J.H. (1992) Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, MA.
Jelsch,C., Mourey,L., Masson,J.M. and Samama,J.P. (1993) Proteins, 16, 364383.[CrossRef][ISI][Medline]
Joern,J.M., Meinhold,P. and Arnold,F.H. (2002) J. Mol. Biol., 316, 643656.[CrossRef][ISI][Medline]
Kamtekar,S., Schiffer,J.M., Xiong,H., Babik,J.M. and Hecht,M.H. (1993) Science, 262, 16801685.
Kono,H. and Saven,J.G. (2001) J. Mol. Biol., 306, 607628.[CrossRef][ISI][Medline]
Korte,B. and Vygen,J. (2002) Combinatorial Optimization: Theory and Algorithms. Springer, Berlin.
Kuhlman,B., Dantas,G., Ireton,G.C., Varani,G., Stoddard,B.L. and Baker,D. (2003) Science, 302, 13641368.
Lawler,E. (1976) Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, New York.
Lim,D., Sanschagrin,F., Passmore,L., De Castro,L., Levesque,R.C. and Strynadka,N.C. (2001) Biochemistry, 40, 395402.[CrossRef][Medline]
Looger,L.L. and Hellinga,H.W. (2001) J. Mol. Biol., 307, 429445.[CrossRef][ISI][Medline]
Looger,L.L., Dwyer,M.A., Smith,J.J. and Hellinga,H.W. (2003) Nature, 423, 185190.[CrossRef][Medline]
Meyer,M.M., Silberg,J.J., Voigt,C.A., Endelman,J.B., Mayo,S.L., Wang,Z.G. and Arnold,F.H. (2003) Protein Sci., 12, 16861693.
Mitchell,M. (1996) An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA.
Moore,G.L. and Maranas,C.D. (2002) Nucleic Acids Res., 30, 24072416.
Neylon,C. (2004) Nucleic Acids Res., 32, 14481459.
Ostermeier,M. (2003) Trends Biotechnol., 21, 244247.[CrossRef][ISI][Medline]
Otey,C.R., Silberg,J.J., Voigt,C.A., Endelman,J.B., Bandara,G. and Arnold,F.H. (2004) Chem. Biol., 11, 309318.[CrossRef][ISI][Medline]
Papadimitriou,C.H. and Steiglitz,K. (1998) Combinatorial Optimization: Algorithms and Complexity. Dover, Mineola.
Pierce,N.A. and Winfree,E. (2002) Protein Eng., 15, 779782.
Ravichandran,K.G., Boddupalli,S.S., Hasemann,C.A., Peterson,J.A. and Deisenhofer,J. (1993) Science, 261, 731736.
Richardson,T.H., Tan,X., Frey,G., Callen,W., Cabell,M., Lam,D., Macomber,J., Short,J.M., Robertson,D.E. and Miller,C. (2002) J. Biol. Chem., 277, 2650126507.
Saraf,M.C., Horswill,A.R., Benkovic,S.J. and Maranas,C.D. (2004) Proc. Natl Acad. Sci. USA, 101, 41424147.
Silberg,J.J., Endelman,J.B. and Arnold,F.H. (2004) Methods Enzymol., 388, 3542.[ISI][Medline]
Stevenson,J.D. and Benkovic,S.J. (2002) J. Chem. Soc., Perkin Trans. 2, 14831493.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.
Voigt,C.A., Kauffman,S. and Wang,Z.G. (2001) Adv. Protein Chem., 55, 79160.
Voigt,C.A., Martinez,C., Wang,Z.G., Mayo,S.L. and Arnold,F.H. (2002) Nat. Struct. Biol., 9, 553558.[ISI][Medline]
Zaccolo,M. and Gherardi,E. (1999) J. Mol. Biol., 285, 775783.[CrossRef][ISI][Medline]
Received August 12, 2004; accepted August 16, 2004.
Edited by Stephen Mayo
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Suemori and M. Iwakura A Systematic and Comprehensive Combinatorial Approach to Simultaneously Improve the Activity, Reaction Specificity, and Thermal Stability of p-Hydroxybenzoate Hydroxylase J. Biol. Chem., July 6, 2007; 282(27): 19969 - 19978. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. P. Treynor, C. L. Vizcarra, D. Nedelcu, and S. L. Mayo Computationally designed libraries of fluorescent proteins evaluated by preservation and diversity of function PNAS, January 2, 2007; 104(1): 48 - 53. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M. Meyer, L. Hochrein, and F. H. Arnold Structure-guided SCHEMA recombination of distantly related {beta}-lactamases Protein Eng. Des. Sel., December 1, 2006; 19(12): 563 - 570. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Yuan, I. Kurek, J. English, and R. Keenan Laboratory-Directed Protein Evolution Microbiol. Mol. Biol. Rev., September 1, 2005; 69(3): 373 - 392. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Drummond, J. J. Silberg, M. M. Meyer, C. O. Wilke, and F. H. Arnold On the conservative nature of intragenic recombination PNAS, April 12, 2005; 102(15): 5380 - 5385. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




















