Skip Navigation


PEDS Advance Access originally published online on March 28, 2008
Protein Engineering Design and Selection 2008 21(5):311-317; doi:10.1093/protein/gzn007
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/5/311    most recent
gzn007v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Muñoz, E.
Right arrow Articles by Deem, M. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Muñoz, E.
Right arrow Articles by Deem, M. W.
Related Collections
Right arrow 2008
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oxfordjournals.org

Amino acid alphabet size in protein evolution experiments: better to search a small library thoroughly or a large library sparsely?

Enrique Muñoz1,2 and Michael W. Deem1,2,3

1Departments of Bioengineering 2Physics and Astronomy, Rice University, Houston, TX 77005–1892, USA

3 To whom correspondence should be addressed. E-mail: mwdeem{at}rice.edu


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
We compare the results obtained from searching a smaller library thoroughly versus searching a more diverse, larger library sparsely. We study protein evolution with reduced amino acid alphabets, by simulating directed evolution experiments at three different alphabet sizes: 20, 5 and 2. We employ a physical model for evolution, the generalized NK model, that has proved successful in modeling protein evolution, antibody evolution and T-cell selection. We find that antibodies with higher affinity are found by searching a library with a larger alphabet sparsely than by searching a smaller library thoroughly, even with well-designed reduced libraries. We also find ranked amino acid usage frequencies in agreement with observations of the CDR-H3 variable region of human antibodies.

Keywords: antibody engineering/directed evolution/generalized NK model/protein engineering/reduced amino acid code


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
Not only are antibodies a primary actor in the vertebrate immune system, but also they are playing an increasing role in medical and biotechnology applications. Their most attractive feature is the ability to recognize and bind chemical molecules with a high affinity and potentially high specificity, which makes them excellent agents for clinical in vitro diagnosis (Schweitzer et al., 2000Go; Yang et al., 2005Go; Kozel et al., 2004Go), affinity chromatography and sensing (Chemla et al., 2000Go; Saleh and Sohn, 2003Go; Grossman et al., 2004Go). There are over 350 monoclonal antibody-based medicines in clinical trials, three times the number it was a decade ago (Marshall et al., 2005Go).

A variety of techniques have been developed for the creation of combinatorial mutant libraries of synthetic antibodies (Barbas III et al., 1991Go; Winter et al., 1994Go; Hanes and Plückthun, 1997Go; Boder et al., 2000Go; Gold, 2001Go; Wilson et al., 2001Go; Kawarasaki et al., 2003Go; Hoogenboom, 2005Go). Phage display (Barbas III et al., 1991Go; Winter et al., 1994Go; Hoogenboom, 2005Go) involves the generation of libraries of mutant cDNA, which are fused with the DNA of surface proteins of bacteriophages M13 or fd. Recombinant proteins are displayed on the phage surface, and the substrate is bound to a spatially addressed surface or beads so that the selection of the most strongly binding antibodies is possible. Ribosome and mRNA display (Hanes and Plückthun, 1997Go; Gold, 2001Go; Wilson et al., 2001Go; Hoogenboom, 2005) constitute a relatively recent, essentially in vitro display technique. The method relies on the stable formation of a complex of an antibody fragment and its encoding mRNA and ribosome. The so-constructed antibody library can be screened, and the selected members are amplified. Microbial cell display (Stahl and Uhlén, 1997Go; Link et al., 2007Go; Jung et al., 2007Go), in particular via the yeast Saccharomyces cereviseae, has successfully yielded very high affinity antibodies (Boder et al., 2000Go). Alternative protocols in this method include random mutagenesis in both the VL and VH variable regions, and direct selection of cell repertoires by flow cytometry (Link et al., 2007Go).

All the experimental methods entail multiple rounds of screening and amplification to identify the strongly binding antibodies in the sequence space of all possible antibodies. The costs and time of the experiments increase in direct proportion to the number of rounds of selection and amplification that must be performed. Since the experimental budget is necessarily limited, one might wonder if it is more efficient to create and search thoroughly a smaller library or to create and search sparsely a larger library.

One strategy to reduce the available antibody sequence space is to create the library from a restricted alphabet of amino acids. If we define the length of the variable region of interest to be L, the size of a potentially complete library synthesized from an amino acid alphabet of size Q is QL. The typical human immunoglobulin variable region has a length of ~100 amino acids (Zemlin et al., 2003aGo), and therefore by employing, for example, a reduced alphabet of Q = 5 amino acids, the available sequence space is reduced by a factor of (5/20)100. Indeed, experiments have been performed with reduced alphabets of Q = 4 (Fellouse et al., 2004Go, 2006Go) and Q = 2 (Fellouse et al., 2005Go). In those studies, the choice for the ‘optimal’ amino acids subset was based on preliminary statistical analysis for the amino acids usage frequencies in the hypervariable complementarity-determining regions (CDRs) of human antibodies (Kabat et al., 1977Go; Collis et al., 2003Go; Zemlin et al., 2003aGo) and of engineered antibodies (Fellouse et al., 2004Go). Based on their natural abundances, the four amino acids such as Tyr, Ser, Ala and Asp, belonging to different chemical groups (Tan et al., 2004Go), were chosen as a potential optimal tetrameric code among the entire alphabet. In particular, Tyr, which seems to have a dominant functional role in antigen recognition at the contact sites, was included (Fellouse et al., 2004Go, 2006Go). Parenthetically, it is worth to notice that in those studies (Fellouse et al., 2004Go, 2006Go), substitution of Tyr by Phe yielded similar or even slightly improved affinities. Both amino acids belong to the same chemical group (Tan et al., 2004Go), possessing an aromatic ring; however, due to the hydroxyl group, Tyr is hydrophilic whereas Phe is hydrophobic. For the minimal binary alphabet (Fellouse et al., 2005Go), based on similar criteria, the combination Tyr and Ser was chosen as optimal.

On the other hand, there are observations that suggest a reduced alphabet may lead to suboptimal results. It has been observed (Zemlin et al., 2003aGo) that, even though there exist biases toward usage of some of the amino acids, particularly Tyr, highly variable regions such as the CDR-H3 of antibodies present an almost uniformly random usage of the amino acids in both humans and mice. In the experimental studies with restricted alphabets (Fellouse et al., 2004Go, 2005Go, 2006Go), the minimal dissociation constants obtained were Kd = 1.6 ± 0.4 nM for Q = 4 and Kd = 60 ± 20 nM for Q = 2, which, despite of being comparable with the ones achieved by the natural immune system, are still orders of magnitude higher than the dissociation constants obtained through phage display methods that employ the entire Q = 20 amino acids alphabet (Boder et al., 2000Go). There exist also theoretical arguments concerning the minimal size of the amino acids alphabet to design a protein (Shakhnovich, 2006Go): the propensity to create the large energy gap that allows the existence of the unique folded state tends to increase with the diversity of the alphabet. Also, the energetic and entropic stability of the folded state of a real protein is enhanced by the availability of a larger alphabet size (Shakhnovich, 2006Go). While the stability of the antibody is ensured by the framework and not necessarily susceptible to this argument, the stability of the antibody–substrate complex is governed by the diversity available in the library.

In this work, we present quantitative arguments to discuss the optimal amino acid alphabet size for directed evolution in protein engineering. For this purpose, we simulated evolution experiments with three different amino acid alphabet sizes: 2, 5 and 20. We develop a theory based on statistical mechanics, the generalized NK model. This theory has proved successful in modeling protein evolution (Bogarad and Deem, 1999Go; Earl and Deem, 2004Go), antibody evolution (Deem and Lee, 2003Go; Gupta et al., 2006Go) and T-cell selection (Park and Deem, 2004Go; Zhou and Deem, 2006Go). In this context, the fitness of a given protein in the evolving population is given by a random energy functional that represents its combined ability to fold and bind to a ligand. The statistical average of this energy is proportional to the Gibbs free energy of association between the protein and its ligand, and is thus proportional to the logarithm of the binding constant. Evolutionary dynamics is driven by successive generations of subdomain swapping, point mutations, screening and selection. Our results suggest that a larger amino acid alphabet leads, at long term, to lower evolved energies, and therefore to higher binding constants. However, we also show that proteins designed with simplified amino acid alphabets evolve faster, in the sense that they achieve their energy minima in a smaller number of generations. From our simulations, we also obtain ranked frequency distributions for the amino acid usage in the complete 20 amino acids alphabet, which show good agreement with the corresponding distribution experimentally observed in the CDR-H3 loop of human immunoglobulins. Finally, a comparison between the Shannon entropies calculated from our simulation and the corresponding values obtained from the observed frequency distributions in the human hypervariable CDR-H3 loop (Zemlin et al., 2003aGo, 2003bGo) reveals that, on average, our sequences evolved by simulated directed evolution encode a similar amount of information as the human CDR-H3 loop.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
Generalized NK model

We developed a theory from statistical mechanics (Bogarad and Deem, 1999Go; Park and Deem, 2004Go), in which we represent the fitness of a given protein sequence within the population by the generalized NK model. The energy function represents the combined ability of the protein to fold and bind to a ligand, and is represented by the expression

Formula 007M1 1
and is composed of three parts: secondary structural subdomain energies (Usd), subdomain–subdomain interaction energies (Usd–sd) and chemical binding energies (Uc).

Simulated proteins are sequences of length 100 amino acids. They are composed of M = 10 secondary structural subdomains, of length N = 10 amino acids each. Secondary structural subdomains can be of one of S = 5 different types (helices, strands, loops, turns and others), and the S different subdomain energy terms are represented by the NK model.


Formula 007M2 2

The range of interactions within a single subdomain extends over K = 4 amino acids. The quenched unit-normal random number {sigma}{alpha} in Eq. (2) is different for each value of its argument, for each of the 1≤ {alpha} ≤ S.

We considered D = 6 different interactions between secondary structures, and the energy of interaction between secondary structures is given by


Formula 007M3 3

Here, the unit-normal number {sigma}{alpha}{gamma}(i) and the set of K interacting amino acids j1, ..., jK are selected at random for each interaction (i,{alpha}, {gamma}).

We assume that P = 5 amino acids contribute to the binding interaction with the substrate, such that the chemical binding energy of each amino acid is given by


Formula 007M4 4

The contributing amino acid i and the unit-normal number {sigma}i are chosen at random.

We considered five chemically different groups of amino acids (Tan et al., 2004Go) (neutral and polar plus cystein, negative and polar, positive and polar, nonpolar without ring and nonpolar with ring). Every single amino acid in the entire 20 amino acids alphabet belongs to a unique group. We consider eight amino acids in the neutral and polar plus cystein group, two amino acids in the negative and polar group, three amino acids in the positive and polar group, four amino acids in the nonpolar without ring group and three amino acids in the nonpolar with ring group. To distinguish between conservative (within the same group) and nonconservative (between different groups) mutations, the energy parameter {sigma} involved in the interaction terms of our model is defined as a quenched Gaussian random number, and is a function both of the amino acid itself and of its chemical class (Park and Deem, 2004Go). More precisely, we set the random parameter {sigma} for the amino acid i which belongs to the group j as {sigma} = wj + wi/2, where the w is Gaussian random number with zero average and unit variance. Therefore, a marked variability is defined among interaction parameters of amino acids belonging to different groups, whereas smaller variations represent individual differences among amino acids within the same chemical class.

Directed evolution simulations

Our simulations represent the evolutionary dynamics of a population of virtual proteins, constituted by a constant number of 1000 sequences. Each protein sequence consists of M = 10 secondary structures, of length N = 10 amino acids each. With these parameters, we represent the typical length of the variable region of human antibodies (Zemlin et al., 2003aGo).

As a starting point, we generate five different subdomain pools, constituted by 250 partially optimized secondary structures each. The degree of optimization of these secondary structures is controlled by a Monte Carlo algorithm that minimizes their energy, as defined by Eq. (1). The role of this first stage in our simulations is interpreted as to provide physically reasonable secondary structures to start the evolutionary dynamics. The initial population of 1000 protein sequences is made from combinations of partially optimized secondary structures, chosen at random from the subdomain pools. The greater the number of Monte Carlo steps used in the creation of these pools, the better designed the fragments of protein structure encoded by these sequences will be.

The evolutionary dynamics, as schematically depicted in Fig. 1, includes two types of moves. The large moves correspond to subdomain ‘swappings’, or exchanges, between the protein sequences in the evolving population and the subdomain pools. With probability pswap = 0.001 per sequence, a secondary structure in a given protein among the population is chosen at random, and replaced by another one from the pool. The short range moves corresponds to single point mutations. We set the number nmut of point mutations per sequence as a Poisson-distributed random variable, with unit average <nmut> =1. These swapping and mutation rates have been characterized as optimal for the generalized NK model (Bogarad and Deem, 1999Go).


Figure 1
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A schematic representation of the evolutionary algorithm implemented in our simulations is presented. A Monte Carlo algorithm allows us to design five partially optimized, low-energy subdomain pools. An initial population of 1000 protein sequences is assembled from random combinations of subdomains. Evolutionary moves are performed by subdomain swappings with the low energy pools, point mutations and screening to select the 10% lowest energy sequences in the population. After amplification to restore the population of 1000 sequences, the process is repeated several rounds.

 
After performing subdomain swapping and point mutations, we simulated a screening process, by selecting the top 10% lowest energy sequences among the population. The chosen 100 protein sequences were amplified back to 1000, to restore the original size of the population. The entire procedure of subdomain swapping, point mutations, screening and amplification can be repeated for an arbitrary number Ngen of generations, to mimic the experimental protocols of directed evolution methods.

By applying the previous algorithm, we evolved in parallel three different populations of proteins, constituted by sequences made from amino acid alphabets of three different sizes: Q = 2, 5 or 20, respectively. For the reduced amino acid alphabets, we chose the ‘best’ Q amino acids subset among the entire 20 amino acids alphabet. To define ‘best’, we tested the combination of Q amino acids, each one from a different chemical group, which yielded the lowest average energy after 100 rounds of evolution. As we require that each of the ‘best’ amino acids is from a distinct group, this procedure requires to try 8 x 2 x 3 x 4 x 3 = 576 different combinations for Q = 5 and 8 x (2 + 3 + 4 + 3) + 2 x (3 + 4 + 3) + 3 x (4 + 3) + 4 x 3 = 149 combinations for Q = 2 respectively. In the experiments with library size of two or four amino acids (Fellouse et al., 2004Go, 2005Go, 2006Go), the amino acids were taken from distinct groups. Moreover, choosing the amino acids from distinct groups is a natural approach to designing a reduced-size library.

Ranked usage frequency distributions

We processed the data on 4751 sequences corresponding to CDR-H3 loops of human antibodies, as reported by Zemlin et al. (2003aGo, 2003bGo). Considering groups of sequences of identical length, we calculated the relative abundance of each of the 20 amino acids in each sequence, and we ranked them in ascending order (1 being the least frequent and 20 the most frequent). By averaging this ranked usage frequencies among each set of sequences of identical length in the database (Zemlin et al., 2003bGo), we obtained the observed order statistics represented as a histogram in Figs. 2Go4. The examples shown correspond to lengths of 8, 14 and 18, represented in the database (Zemlin et al., 2003bGo) by groups of 74, 534 and 303 sequences, respectively. Sequences of length 14 were the more abundant in the database (Zemlin et al., 2003bGo).


Figure 2
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The ranked amino acid usage frequency distributions, as obtained from our generalized NK model, are compared with the observed distribution in the Human CDR-H3 loop (Zemlin et al., 2003bGo), for sequences of eight residues in length. In the second case, the ranked histogram was obtained from the 74 sequences of length 8 reported by Zemlin et al. (2003b)Go. Also displayed (dashed line) is the distribution arising from the generation of completely random sequences of the same length.

 

Figure 3
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The ranked amino acid usage frequency distributions, as obtained from our generalized NK model, are compared with the observed distribution in the Human CDR-H3 loop (Zemlin et al., 2003bGo), for sequences of 14 residues in length. In the second case, the ranked histogram was obtained from the 534 sequences of length 14, the most frequent length in the data reported by Zemlin et al. (2003b)Go. Also displayed (dashed line) is the distribution arising from the generation of completely random sequences of the same length.

 

Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The ranked amino acid usage frequency distributions, as obtained from our generalized NK model, are compared with the observed distribution in the Human CDR-H3 loop (Zemlin et al., 2003bGo), for sequences of 18 residues in length. In the second case, the ranked histogram was obtained from the 303 sequences of length 18 reported by Zemlin et al. (2003b)Go. Also displayed (dashed line) is the distribution arising from the generation of completely random sequences of the same length.

 

    Results
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
Directed evolution experiments at different sizes of the amino acid alphabet

We represent the evolutionary dynamics of a population of 1000 proteins. To represent the variable region of human antibodies (Zemlin et al., 2003aGo), we consider the sequence to be composed of 10 secondary structures, with a total length of 100 amino acids. These proteins undergo point mutation, swapping of subdomains from pools of optimized structures and selection (see Methods). The experiment is followed for 100 rounds of this procedure.

A key parameter in the protein evolution protocol is how optimized the secondary structural pieces are. Typically, we might imagine that these would be quite optimal, as a result of evolution and natural selection. Indeed, the fact that secondary structures such as alpha helices and beta strands are ubiquitous in nature and across different species suggests that in the picture of the sequence-space energy landscape provided by our theory, those structures should be represented by deep and stable energy minima. Moreover, a high degree of optimization of the secondary structural pieces in our simulations seems to accurately reproduce the amino acid usage frequency distributions observed in natural human antibodies (see Amino acid usage frequency distribution: a bridge between theory and experiment).

The other parameter we investigate is the size of the library, i.e. how many different amino acids are included in the makeup of the library. Figure 5 shows the average evolved energies among our population of 1000 protein sequences, displayed as a function of the number of rounds of subdomain swappings and point mutations, screening and amplification. The subdomain pools for this figure are such that they are optimal for the alphabet size of 2 or 5, and suboptimal for alphabet size of 20 (Bogarad and Deem, 1999Go). We see that under these realistic conditions, proteins of lower energy are produced through the use of a larger alphabet. That is, it is better to search a large library space sparsely than a smaller library space thoroughly.


Figure 5
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The evolved energies for the three systems (Q = 2, 5 and 20) are compared. The subdomain pools were designed with 10 000 Monte Carlo steps. These starting sequences are relatively poorly designed. The curves corresponding to the systems with 5 and 20 amino acid alphabets cross after approximately the first 30 generations, comparable with the 10 rounds of a typical short protein evolution experiment.

 
If, for some reason, the initial subdomain pools cannot be designed as optimally, i.e. the initial sequences are not particularly protein like, then the evolved energies may be different. In Fig. 6, we show results for subdomain pools that are suboptimal for all alphabet sizes. We see that at low number of rounds (<30), searching the five-alphabet library more thoroughly than the 20-alphabet leads to proteins with better energies. Searching the two-alphabet library thoroughly, however, still leads to proteins with worse energies.


Figure 6
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The evolved energies for the three systems (Q = 2, 5 and 20) are compared. The subdomain pools were designed with 100 000 Monte Carlo steps. The larger alphabet systems display the lowest energies.

 
The library produced with an alphabet size of 2 is thoroughly searched in all these examples, no matter what the optimality of the initial subdomain pools. The libraries with larger alphabet sizes, however, are only partially searched in the size of experiments presented here. While less thoroughly searched, these libraries produce proteins with lower energies. Only for initial pools of random, non-protein-like sequence do we see proteins of lower energy found in the five-alphabet library than in the 20-alphabet library.

Amino acid usage frequency distribution: a bridge between theory and experiment

We compare our results to experiment by calculating the frequencies of amino acid usage. Shown in Figs. 24 are the amino acid usage frequencies found in the CDR-H3 region of human antibodies versus those found in our simulations. We calculated the relative abundance of each of the 20 amino acids in the sequences from the CDR-H3 loop region of human antibodies, as reported in Zemlin et al. (2003aGo, 2003bGo). We present the usage in ranked order from the least (1) to the most (20) frequently used amino acid.

We calculated the same quantity from the proteins in our simulation. The experimental histograms are compared with the ranked frequency distribution as predicted by our simulations, when the whole 20 amino acids alphabet is used to generate and evolve the proteins for 100 rounds. In this case, within each protein sequence of length 100, we chose a smaller subsequence of a fixed length (8, 14, 18), and obtained the corresponding order statistics. The results of this analysis are displayed as the blue dashed line in Figs. 24.

We compare how the distributions of amino acid usage in human CDR-H3 and in the model differ from a purely random distribution of amino acids. For this purpose, we performed an independent numerical experiment by generating completely random sequences of the specified length, where each digit in the sequence is a random variable {Psi}i,1≤ i ≤ L, which may take any value in the set {1,2, ..., 20}, with probabilities


Formula 007M5 5

The resulting order statistics for this random process are displayed as the red line in Figs. 24.

By comparing Figs. 24, which corresponds to sequences of length 8, 14 and 18, respectively, there is a remarkable agreement, in all three cases, in the experimental data (Zemlin et al., 2003aGo, 2003bGo) to either the random process (red dashed curve) or to the generalized NK model (blue dashed curve). Indeed, the trend observed is that for sequences of length smaller than 10 amino acids, the experimental order statistics is closer to the random process, whereas for longer sequences, it closely approaches the generalized NK model results.

Shannon entropy and information: a quantitative measure for protein design

Within the context of information theory (Shannon, 1948Go), the Shannon entropy is defined as a quantitative measure of the degree of disorder or randomness in a system. Applied in the context of the present study, if Q is the alphabet size and pi are the amino acid usage frequencies, the corresponding formula for the entropy is (Shannon, 1948Go):


Formula 007M6 6

A related concept to the Shannon entropy is Information (Shannon, 1948Go; Layzer, 1990Go), defined as the difference between the maximum possible entropy for a particular system and the value of its entropy at its actual configuration

Formula 007M7 7

According to this definition, the information IQ contained in a given population of evolved sequences is a direct measure of the strength of the correlations established by the evolutionary process itself. We apply this concept and compared the degree of design imposed by our theoretical model over the simulated population of virtual protein sequences to the degree of design achieved by the human immune system in the variable CDR-H3 region (Zemlin et al., 2003aGo, 2003bGo). As explained in the precedent section, such a comparison can be done by considering the ranked usage distribution of amino acids in both cases, as presented in Figs. 24 for different sequence lengths, and calculating the corresponding information values.

To calculate the information according to Eq. (7), we considered, as a reference system for the maximum entropy, the ranked usage distribution obtained from the totally random process. The results of this numerical analysis are displayed in Table I, where the superscript GNK refers to the simulation according to the generalized NK model, whereas CDR-H3 refers to the data from Zemlin et al. (2003b)Go.


View this table:
[in this window]
[in a new window]

 
Table I. Entropy and information values

 

    Discussion
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
We simulated directed evolution experiments, for a finite population of protein sequences, at three different sizes of the amino acids alphabet: 20, 5 and 2. For the smaller alphabet systems, we chose the restricted alphabet as the best set of two and five amino acids, respectively, each of them belonging to a different chemical group, based on an energy minimization criteria. By choosing the subset of amino acids from different chemical groups, we imitate the criteria employed in the experimental protocols of directed evolution (Fellouse et al., 2004Go, 2005Go). This also minimizes the diversity in the reduced library to be searched thoroughly.

Our simulations represent the average evolved energies as a function of the number of generations of mutation and swapping, screening and amplification performed over the three different alphabet sizes. We compared the evolved energies achieved by the three systems, and also the effect of the degree of initial optimization in the secondary subdomain pools. Our results indicate that at long term, the use of a complete 20 amino acids alphabet to build the protein sequences leads to the lowest evolved energies, and therefore higher binding constants when engineering antibodies by directed evolution techniques. However, if the amount of time and experimental resources is limited to a short number of rounds of screening, it may be advantageous to employ a restricted alphabet chosen from the optimal combination of 5 amino acids among the 20, with each amino acid chosen from a different chemical group. This result, however, is found only for initial subdomain pool designs that are severely suboptimal. We note that, according to our results, the use of a binary alphabet is never a good choice, as compared with larger alphabets. In general, we find that a larger size of the available sequence space will yield, after a long enough searching procedure, lower average energy minima.

These theoretical results seem to be in agreement with the experimental data reported in Fellouse et al. (2004Go, 2005Go, 2006Go). These experiments found superior binding constants, i.e. lower dissociation constants, with a tetrameric alphabet than with a binary alphabet for searching of combinatorial libraries using the phage display method. In particular, the minimal dissociation constants obtained were Kd = 1.6 ± 0.4 nM for Q = 4 and Kd = 60 ± 20 nM for Q = 2, which are, in turn, orders of magnitude higher than the dissociation constants obtained through phage display methods (Kd = 48 fM) which employ the entire Q = 20 amino acids alphabet (Boder et al., 2000Go).

We propose that the generalized NK model, whose conceptual basis arises from the statistical mechanics of disordered systems, is representative of the sequence level correlations in the highly variable regions of antibodies, which are relevant during natural and in vitro evolutionary processes. This hypothesis is supported by a Shannon entropy and information analysis of the ranked usage frequency distributions, by comparing the values generated by our simulations with data reported in the literature for the human CDR-H3 loop (Zemlin et al., 2003bGo). It is remarkable that the distributions obtained in both cases are very close to the corresponding ranked distribution for a system of completely random sequences, which we obtained from a numerical random process. The shorter sequences (eight amino acids) are random, reflecting the essentially random requirements to make subdomain structure. The longer sequences (18 amino acids) display the correlations induced by the subdomains and captured by the generalized NK model.

In accordance with the graphical analysis presented in Figs. 24, the data in Table I show that, for all three cases, the simulated sequences contain a higher amount of information than protein sequences of the human CDR-H3 loop (Zemlin et al., 2003bGo). Equivalently, we may say that our simulations introduce a slightly higher degree of design over the evolved population of virtual sequences than is observed in the hypervariable regions of human antibodies. However, it is also important to notice that, in accordance with the qualitative graphical analysis of Figs. 24, the information in the simulated sequences systematically approaches the experimental data as the length increases, with <6% discrepancy for sequences of 18 residues, as is also apparent in Figs. 24. To understand this behavior, one may first notice that at the sequence level, proteins must possess a minimal correlation length. It is natural to assume that this is physically imposed by the size of the smallest secondary structures. In our model, those are represented by the energy-optimized structural subdomains, whose length is 10 amino acids. Based on this argument, for protein sequences smaller than the minimal correlation length, it is expected that the amino acid distribution is random, whereas for longer sequences correlations exist and are captured by the generalized NK model. Following this argument, the minimal correlation length in human antibody sequences seems to be close to 10 residues, approximately the same as proposed in our generalized NK model. Finally, the information values corresponding to the three different sequence lengths, as obtained from our simulations, constitute <10% of the corresponding maximal entropy. This suggests that the evolutionary dynamics, as performed in our model and in the human immune system, introduces a moderate degree of design in proteins, according to the conceptual picture of ‘edited’ random sequences (Ptitsyn and Volkenstein, 1986Go).

In summary, it seems that construction and exploration of a larger library, even if not searched thoroughly, is the most effective strategy for efficient protein evolution. Interestingly, this result suggests a general motivation for the search to discover and incorporate unnatural amino acids in the library (van Hest et al., 2000Go; Anderson et al., 2004Go; Tian et al., 2004Go).


    Funding
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
This work was partially supported by DARPA and by the Keck Center Nanobiology Training Program of the Gulf Coast Consortia (N.I.H. R90DK071504-01).


    Footnotes
 
Edited by David Thirumalai


    Acknowledgments
 
It is a pleasure to acknowledge stimulating discussions with K. Dane Wittrup.


    References
 Top
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Funding
 References
 
Anderson J.C., Wu N., Santoro S.W., Lakshman V., King D.S., Schultz P.G. Proc. Natl Acad. Sci. USA (2004) 101:7566–7571.[Abstract/Free Full Text]

Barbas C.F. III, Kang A.S., Lerner R.A., Benkovic S.J. Proc. Natl Acad. Sci. USA (1991) 88:7978–7982.[Abstract/Free Full Text]

Boder E.T., Midelfort K.S., Wittrup K.D. Proc. Natl Acad. Sci. USA (2000) 97:10701–10705.[Abstract/Free Full Text]

Bogarad L.D., Deem M.W. Proc. Natl Acad. Sci. USA (1999) 96:2591–2595.[Abstract/Free Full Text]

Chemla Y.R., Grossman H.L., Poon Y., McDermott R., Stevens R., Alper M.D., Clarke J. Proc. Natl Acad. Sci. USA (2000) 97:14268–14272.[Abstract/Free Full Text]

Collis A.V.J., Brouwer A.P., Martin A.C.R. J. Mol. Biol. (2003) 325:337–354.[CrossRef][Web of Science][Medline]

Deem M.W., Lee H.Y. Phys. Rev. Lett. (2003) 91:068101.[CrossRef][Medline]

Earl D.J., Deem M.W. Proc. Natl Acad. Sci. USA (2004) 101:11531–11536.[Abstract/Free Full Text]

Fellouse F.A., Wiesmann C., Sidhu S.S. Proc. Natl Acad. Sci. USA (2004) 101:12467–12472.[Abstract/Free Full Text]

Fellouse F.A., Li B., Compaan D.M., Peden A.A., Hymowitz S.G., Sidhu S.S. J. Mol. Biol. (2005) 348:1153–1162.[CrossRef][Web of Science][Medline]

Fellouse F.A., Barthelemy P.A., Kelley R.F., Sidhu S.S. J. Mol. Biol. (2006) 357:100–114.[CrossRef][Web of Science][Medline]

Gold L. Proc. Natl Acad. Sci. USA (2001) 98:4825–4826.[Free Full Text]

Grossman H.L., Myers W.R., Vreeland V.J., Bruehl R., Alper M.D., Bertozzi C.R., Clarke J. Proc. Natl Acad. Sci. USA (2004) 101:129–134.[Abstract/Free Full Text]

Gupta V., Earl D.J., Deem M.W. Vaccine (2006) 24:3881–3888.[CrossRef][Web of Science][Medline]

Hanes J., Plückthun A. Proc. Natl Acad. Sci. USA (1997) 94:4937–4942.[Abstract/Free Full Text]

Hoogenboom H.R. Nat. Biotechol. (2005) 23:1105–1116.[CrossRef][Web of Science][Medline]

Jung S.T., Jeong K.J., Iverson B.L., Georgiou G. Biotechnol. Bioeng. (2007) 98:39–47.[CrossRef][Web of Science][Medline]

Kabat E.A., Wu T.T., Bilofsky H. J. Biol. Chem. (1977) 252:6609–6616.[Free Full Text]

Kawarasaki Y., Griswold K., Stevenson J.D., Selzer T., Benkovic S.J., Iverson B.L., Georgiou G. Nucleic Acids Res. (2003) 31.

Kozel T.R., Murphy W.J., Brandt S., Blazar B.R., Lovchik J.A., Thorkildson P., Percival A., Lyons C.R. Proc. Natl Acad. Sci. USA (2004) 101:5042–5047.[Abstract/Free Full Text]

Layzer D. Cosmogenesis: the Growth of Order in the Universe (1990) New York: Oxford University Press.

Link A.J., Jeong J.K., Georgiou G. Nat. Rev. Microbiol. (2007) 5:680–688.[CrossRef][Web of Science][Medline]

Marshall A., DeFrancesco L., Aschheim K., Taroncher-Oldenburg G., Francisco M., Hare P., Louët S., Theunissen J.W. Nat. Biotechol. (2005) 23:1025.[CrossRef][Web of Science][Medline]

Park J.M., Deem M.W. Phys. A (2004) 341:455–470.[CrossRef]

Ptitsyn O.B., Volkenstein M.V. J. Biomol. Struct. Dynamics (1986) 4:137–156.[Web of Science][Medline]

Saleh O.A., Sohn L.L. Proc. Natl Acad. Sci. USA (2003) 100:820–824.[Abstract/Free Full Text]

Schweitzer B., Wiltshire S., Lambert J., O'Malley S., Kukanskis K., Zhu Z., Kingsmore S.F., Lizardi P.M., Ward D.C. Proc. Natl Acad. Sci. USA (2000) 97:10113–10119.[Abstract/Free Full Text]

Shakhnovich E. Chem. Rev. (2006) 106:1559–1588.[CrossRef][Web of Science][Medline]

Shannon C.E. Bell Syst. Tech. (1948) 27:623–656.

Stahl S., Uhlén M. Trends Biotechnol. (1997) 15:185–192.[CrossRef][Web of Science][Medline]

Tan T., Bogarad L.D., Deem M.W. J. Mol. Evol. (2004) 59:385–399.[CrossRef][Web of Science][Medline]

Tian F., Tsao M.L., Schultz P.G. J. Am. Chem. Soc. (2004) 126:15962–15963.[CrossRef][Web of Science][Medline]

van Hest J.C.M., Kiick K.L., Tirrell D.A. J. Am. Chem. Soc. (2000) 122:1282–1288.[CrossRef][Web of Science]

Wilson D.S., Keefe A.D., Szostak J.W. Proc. Natl Acad. Sci. USA (2001) 98:3750–3755.[Abstract/Free Full Text]

Winter G., Griffiths A.D., Hawkins R.E., Hoogenboom H.R. Ann. Rev. Immunol. (1994) 12:433–455.[Web of Science][Medline]

Yang C.Y., Brooks E., Li Y., Denny P., Ho C.M., Qi F., Shi W., Wolinsky L., Wu B., Wong D.T.W., Montemagno C.D. Lab Chip (2005) 5:1017–1023.[CrossRef][Web of Science][Medline]

Zemlin M., Klinger M., Link J., Zemlin C., Bauer K., Engler J.A., Schroeder H.W.J., Kirkham P.M. J. Mol. Biol. (2003) a 334:733–749.[CrossRef][Web of Science][Medline]

Zemlin M., Klinger M., Link J., Zemlin C., Bauer K., Engler J.A., Schroeder H.W.J., Kirkham P.M. J. Mol. Biol. (2003) b 334:733–749. data available at doi: 10.1016/j.jmb.2003.10.007.[CrossRef][Web of Science][Medline]

Zhou H., Deem M.W. Vaccine (2006) 24:2451–2459.[CrossRef][Web of Science][Medline]

Received January 25, 2008; revised January 25, 2008; accepted February 11, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/5/311    most recent
gzn007v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Muñoz, E.
Right arrow Articles by Deem, M. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Muñoz, E.
Right arrow Articles by Deem, M. W.
Related Collections
Right arrow 2008
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?