PEDS Advance Access originally published online on January 10, 2006
Protein Engineering Design and Selection 2006 19(2):67-75; doi:10.1093/protein/gzj002
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© The Author (2006). Published by Oxford University Press. All rights reserved.
An empirical approach for detecting nucleotide-binding sites on proteins
1Division of Biological Science, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8603, Japan 2Ochanomizu University, 211 Otsuka, Bunkyo-ku, Tokyo 112-8601, Japan 3Department of Bioscience, Nagahama Institute of Bioscience and Technology, 1266 Tamura, Nagahama 526-0829, Japan 4Department of Computational Biology, Biomolecular Engineering Research Institute, 623 Furuedai, Suita, Osaka 565-0874, Japan 5Japan Science and Technology CorporationBioinformatics Research and Development
6 To whom correspondence should be addressed at Nagahama Institute of Bioscience and Technology. E-mail: t_shirai{at}nagahama-i-bio.ac.jp
| Abstract |
|---|
|
|
|---|
Protein structure data in the PDB (Protein Data Bank) were used to construct empirical scores of nucleotideprotein interactions. A simple strategy to evaluate the spatial distribution of protein atoms around the base moieties of nucleotides was applied to categorize adenine, guanine, nicotinamide and flavin nucleotide-binding sites. In addition to the known nucleotide-binding motifs, the empirical scores detected several other features that were shared among proteins with different folds. The empirical scores were also used to predict the binding sites on protein molecules and a comprehensive test of the prediction system was performed. As a result, adenine, guanine, nicotinamide and flavin sites were detected with efficiencies of 31, 29, 32 and 40%, respectively. The predictions were judged to be successful if the predicted base with the best score was located within a 3.0 Å r.m.s.d. from the known ligand positions.
| Introduction |
|---|
|
|
|---|
The number of known protein structures has increased rapidly owing to the development of structural genomics projects (Berman et al., 2000
Nucleotides are the most important ligands among the molecules that interact with proteins. Many proteins use nucleotides as an energy source and as signaling molecule in cells (adenosine/guanine nucleotides). The flavin and nicotinamide nucleotides work as electron donors/acceptors in a variety of redox reactions. Finding and predicting nucleotide-binding motifs in protein structures are important subjects in protein interaction studies.
Early searches for the conserved features in nucleotide-binding sites were performed with adenine nucleotide-binding proteins. Comparative structural analyses of adenylateprotein complexes have revealed the large variety of protein families that bind nucleotides (Schulz, 1992
). The analyses led to the identification of amino acid sequence motifs, such as the Walker A and B motifs (Walker et al., 1982
) and the P-loop (Saraste et al., 1990
), which bind the phosphate groups of nucleotides. Later, an extensive structural comparison of these motifs had revealed the conserved loop structure which interacted with the phosphate group through main-chain amide-N atoms (Swindells, 1993
; Kinoshita et al., 1999
).
As the first recognition motif of a base moiety, the fuzzy recognition motif was proposed, based on an analysis of 18 adenylate-binding proteins (Moodie et al., 1996
). Kobayashi and Go identified a four-residue loop structure that formed H-bonds with the N6 and N1 atoms of the base moiety through the main-chain atoms of three different protein fold families: D-Ala:D-Ala ligase, casein kinase-1 and cyclic AMP-dependent protein kinase (Kobayashi and Go, 1997a
,b
). Later, Johnson's group reported that the three residue loops that recognize the N6 and N1 atoms of adenine were conserved in 12 unrelated protein families (Denessiouk and Johnson, 2000
). More than one-third of the adenine mononucleotide-binding proteins in the PDB possess the Johnson motif (Denessiouk et al., 2001
). Mao et al. proposed a mono-residue motif that recognized adenine, based on their analysis of 68 non-redundant protein structures. This motif was shared by five different folds; namely the adenine nucleotide hydrolase-like, carbamate kinase-like, class II aaRS and biotin synthetase, glutamine synthetase/guanido kinase and P-loop-containing nucleotide triphosphate hydrolase folds (Mao et al., 2004
).
Adenine and guanine resemble each other in terms of both shape and size. Indeed, guanine is also recognized by the fuzzy recognition motifs (Nobeli et al., 2001
). In spite of the similarity between guanine and adenine recognition, Nobeli et al. described the differences in H-bonding patterns, the preference for interacting amino acids and the burial tendency in protein molecules between the two nucleotides (Nobeli et al., 2001
).
Nicotinamide and flavin (isoalloxazine) nucleotides accept or donate an electron in redox reactions. The motifs for these bases are not well characterized compared with the two purine nucleotides. The binding sites of nicotinamide and flavin tend to be concomitant with those for other substrates, which makes it difficult to detect structural motifs for these nucleotides (Carugo and Argos, 1997
). Conserved structural motifs around the N1 and N5 atoms of flavin moieties were characterized among 13 flavoproteins that catalyze dehydrogenation reactions (Fraaije and Mattevi, 2000
). However, a comprehensive survey including other flavin-containing nucleotides, such as RBF (riboflavin) and FMN (flavin mononucleotide), has not yet been reported. Also, no particular binding motif has been proposed for nicotinamide bases (Carugo and Argos, 1997
).
The classification of motifs has provided a basis for nucleotide binding site prediction in proteins. Zhao et al. (2001)
used a grid-based method that employed 16 adenine nucleotide-binding proteins. They used the consensus potential landscapes of the base atoms for a recognition template and then used the template for docking simulations of the base groups to the target proteins. They tested the efficiency of this prediction system against 31 adenine dinucleotide-binding proteins. For each of the target proteins, 10 docking simulations were performed. The system detected the known binding sites within a 2.0 Å r.m.s.d. for 129 trials and within a 3.0 Å r.m.s.d. for 230 in a total of 310 trials. Kuttner et al. (2003)
developed an atomic cluster method, which was based on a training set of 14 ATPprotein complexes. Vertex systems were derived from the clusters of protein atoms around the base group and the binding sites were sought by comparing the target protein structure and the vertexes. This method was tested on 22 adenine nucleotide-binding sites and it was demonstrated that at least one of the predicted binding sites, which had more than six hits to the vertex system, shared at least one atom with the experimentally determined sites in 20 test cases.
So far, attempts at the categorization and prediction of nucleotide-binding sites have been done on relatively small sets of proteins. To be applied to the large amounts of data accumulated from structural genomics, more generalized and automated processes would be required for motif detection and binding site prediction. In this work, the known protein structures in the PDB were examined to construct an empirical score system (Ishida et al., 2000
; Shionyu-Mitsuyama et al., 2003
) for the categorization and prediction of adenine, guanine, nicotinamide and flavin binding sites. A total of 386 known nucleotide-binding protein structures, without redundancy, were used for the construction of this empirical score system. The sample structures contained 280, 43, 62 and 66 nucleotide complexes of adenine, guanine, nicotinamide and flavin, respectively. From this newly devised score system, several additional motifs were found for adenine and guanine. Canonical motifs were also proposed for flavin and nicotinamide.
Further, the score system was applied to binding site predictions and tested against most of the proteins used for the score system construction (277, 41, 59 and 63 nucleotide complexes for adenine, guanine, nicotinamide and flavin, respectively). As a result, adenine, guanine, nicotinamide and flavin sites were detected with efficiencies of 31, 29, 32 and 40%, with the predictions judged as being correct when the predicted base with the best score was located within a 3.0 Å r.m.s.d. from the known ligands.
| Materials and methods |
|---|
|
|
|---|
Data set selection for empirical score construction
A high-resolution and non-redundant set of nucleotideprotein complex structures were prepared to extract empirical scores for adenine, guanine, nicotinamide and flavin nucleotides. First, a list of all of the ligand types in the PDB (at May 2004) was made. Only the molecules that contain the four base groups, without any modification or deficit in the base atoms, were selected. Some rare stereoisomers were discarded. The numbers of retained molecule types were 195 for adenine, 57 for guanine, 12 for nicotinamide and 7 for flavin. The numbers of complexes containing these molecule types were 4877 for adenine, 714 for guanine, 1470 for nicotinamide and 951 for flavin nucleotides.
The following steps were applied to eliminate redundant or inadequate samples. (1) As the first screening, the structures determined by X-ray analysis with resolutions >2.5 Å and those solved by NMR were selected. At least one atom pair between the base and the protein must be <3.5 Å and no other type of ligand molecule should be in direct contact (within 6.0 Å) with the base atoms. (2) The subunits that passed the first screen were compared with each other. The subunits that showed more than 25% amino acid identity were clustered and the subunit with the highest resolution of X-ray analysis was selected to be representative of each cluster. The X-ray structures took priority over the NMR structures. (3) The local structures of the binding sites (amino acid residues in direct contact with the base moiety) were compared. The proteins/subunits were superposed by using the base moieties and the structures with r.m.s.d.s of the corresponding C
atoms <3.0 Å or of the main chain atoms <2.0 Å were clustered. The nucleotide complexes with the largest number of atoms in direct contact with the base moiety were selected as the representative from each cluster. Although these processes were mostly automated by using in-house programs, visual and off-line inspections were also made for the present work.
The final data sets (which we will refer to as the learning sets) for adenine, guanine, nicotinamide and flavin contained 280 complexes (59 ligand types), 43 complexes (18 ligand types), 62 complexes (7 ligand types) and 66 complexes (3 ligand types), respectively (names and PDB codes are listed in the supplementary Table I, available at PEDS Online).
|
Empirical score construction
The empirical scores for the spatial distribution of the protein atoms around the base moiety were constructed from the learning set. The procedure was essentially the same as that previously employed for peptide and carbohydrate predictions and the programs developed for those predictions were used for data processing (Ishida et al., 2000
; Shionyu-Mitsuyama et al., 2003
).
First, all of the structures were transferred into a reference coordinate system, which consists of an evenly distributed grid with 1.0 Å intervals, by using three atoms from each base moiety (C5, N7 and C6 for adenine and guanine, NC3, NC2 and NC7 for nicotinamide and C10, N1 and C7 for flavin; see Figure 1 for definition). The first atom was placed at the origin of the reference coordinate system. The bond between the first and second atoms was aligned with the x-axis by placing the second atom on the positive region. The bond from the first to the third atom was laid on the xy plane.
|
Then, the frequencies of protein atoms on the molecular surface that existed within 12.0 Å from the base atoms were counted. The protein atoms were categorized into 13 target-atom types (Table I). The counts of target atoms were stacked up on the closest grid point in the reference system. Then, the count of the target atom i at position r (Nir) was replaced with the average over 27 grid points centered by r. Finally, the count on each grid point was processed into the score as Sir = (Nir <Nir>)/<Nir>, where <Nir> is the average over all grid points and Sir is the score for atom i at position r.
Binding site evaluation and prediction system
A prediction system for nucleotide-binding sites was also developed from the empirical score system. This system requires the target protein structure as input and outputs the predicted coordinates of the nucleotides. The system scans the surface of the target protein with the scoring system (Sir) for the high score positions. The center of the score system was set on a point in the reference grid system, one by one and the score system was rotated around each reference grid point. The score for a binding site was obtained as the summation of Sir over the corresponding target-atom types and over the protein atoms.
The searches were done in two steps. In the first step, the reference grid consisted of the points distributed in 1.2 Å intervals (coarse grid). A point was not used if it was less than 2.5 Å or more than 5.0 Å away from the nearest protein atom, in order to search the surface region only. The origin of the score system was moved to every point in the reference grid and the score system was rotated around the point at 30°steps in spherical polar angles. By scanning through all of the combinations of reference points (translation) and rotations, the best 50 positions were selected in the first step. In the 2nd step, the reference grid interval was reduced to 0.8 Å (fine grid). The grid points were distributed in a 3.23 Å 3 box centered at each coarse-grid point selected in the first search. The rotation in the second step was reduced to 15°.
For a typical target protein with about 300 residues,
4.6 x 108 configurations (combinations of translation and rotation of base moiety) were generated in a search process. Supplementary Figure S1 shows the number of examined configurations against the size of the target proteins. The detected base positions were clustered if two base moieties showed an r.m.s.d. of <4.0 Å. Then, each representative that showed the highest score in the cluster was selected. The best 100 representatives were output as the prediction results.
Prediction test
The performance test of the prediction system was executed by using the known structures of nucleotideprotein complexes. The test targets were selected from the learning set, except for the complexes that had fewer than two close proteinligand atom contacts (<3.5 Å). When a protein was used for the test target, it was excluded from the empirical score construction. If the r.m.s.d. between the predicted position of the base moiety and that of the known (experimental) ligand molecule was <3.0 Å, then the prediction was judged to be successful.
The performance of this prediction system was compared with that of Kuttner et al. (2003)
. They evaluated their prediction results with the overlap between protein atoms of experimentally determined and predicted binding sites. The predictions were judged to be successful if a predicted site had at least one atom overlap with the experimentally determined site. The 20 proteins (22 adenine nucleotide-binding sites), which were used in the previous work, were applied to the prediction system of this work. The prediction results were processed with the same program as the previous work to detect atom overlap (Sobolev et al., 1999
).
| Results and discussion |
|---|
|
|
|---|
Nucleotide-binding motifs from an empirical score system
An empirical score of the interaction between a protein and the base moiety of a nucleotide was extracted from a non-redundant learning set of complex protein structures. The score was constructed for 13 types of protein atoms for each base (Table I).
Figure 2 shows examples of the scoring system. The high-score regions of the target atoms were enclosed in network plots. The scores of the known binding sites were calculated for every protein in the learning set and 10% of the average score was used for the threshold value for each of the target-atoms in Figure 2. The presented conformations of the nucleotides are those most frequently observed in the learning set.
|
The empirical score system is a numerical expression of nucleotide-binding motifs of proteins, in which a higher score represents a more frequently used atom position. First, the score system was used to categorize the recognition motifs of each base, by combining the automatic extraction of residues that coincide with the score peaks from the learning set and the visual inspections.
Adenine motif
The main-chain atoms characterize the recognition motifs for adenine. The main-chain carbonyl-O and amide-N atoms from a peptide of one or three consecutive residues frequently made H-bonds with the adenine N1 (acceptor) and N6 (donor) atoms at the WC edge (Denessiouk and Johnson, 2000
; Cappello et al., 2002
; Mao et al., 2004
). Later, Johnson's group proposed the acceptordonoracceptor (ADA) motif on the three edges of adenine base. The ADA motif at WC edge contained a C2 donor along with the previously defined acceptors of N6 and N1 atoms (Denessiouk and Johnson, 2003
). In the ADAbz motif, the carboxyl-O atoms of Glu or Asp, instead of the main-chain-O atoms, were used as the acceptor for the N1 atom.
In the constructed score system, an array of polar atom peaks was observed close to the WC and RB edges (Figure 2a). The highest was the main-chain-N peak (A4a in Figure 2a), which interacted with the N1 atom of adenine. About 50% (140/280) of the learning set used this peak. The main-chain-O peaks A3a (34%, 120/280) and A3b (40%, 111/280) followed the highest peak. The combination A3aA4aA3b corresponds to the Johnson ADA motif and was simultaneously used by 16% of the samples (45/280). The carboxyl-O peak (A9a) as the N6 donor was used by 13% (36/280) of the learning set. A9aA4aA3b represents the ADAbz motif and it showed 8% (22/280) frequency. Including these known motifs, 39% (108/280) of the learning set used two or more peaks of A3a, A4a, A9a and A3b and made at least two H-bonds at the WC edge through short peptide fragments. We refer to this motif at the WC edges as the A1-motif (Figure 3a).
|
Another high carboxyl-O peak (A9b in Figure 3a) was found close to the ribose moiety (A2-motif in Figure 3a). In 22% (62/280) of the learning set, the carboxyl-O of Asp or Glu makes bifurcated H-bonds with the O2* and O3* atoms of ribose. This interaction was previously proposed for the conserved acidic residues at the N-terminal end of the second ß-strand in the Rossman-fold fingerprint (Bellamacina, 1996
|
Aromatic-N (A13 in Figure 2a) and charged-amino-N (A12) peaks were observed on the downside of the base moiety in Figure 2. The A13 peak was closer to the six-membered ring, while the A12 peak was proximal to the five-membered ring of the base. About 23% (65/280) of the learning set used at least one of these peaks (A3-motif in Figure 3a). These protein atoms appeared to make a
-electron H-bond with the base moiety and to contribute to determining the adenine position (Mao et al., 2004Guanine motif
For the guanine base, an array of polar atom peaks from the side chains was observed at the WC and HS edges. The highest was the carboxyl-O peak (G9 in Figure 2b). The side chain of Asp or Glu made H-bonds with the N1 and N2 donors of the base group. This recognition motif was used by 51% (22/43) of the learning set (G1-motif in Figure 3b).
The second largest peak was the main-chain-N peak (G4), which was used by 35% (15/43) of the learning set. Eleven of them have an
-helix with the N-terminal side pointed to the ß-phosphate group of the nucleotide and a loop interacting with the phosphate group through main-chain amide N (G2-motif in Figure 3b). This motif corresponded to the P-loop (Saraste et al., 1990
; Kinoshita et al., 1999
). Although some of the adenine nucleotide-binding proteins also use the P-loop motif, it was not clearly defined in the adenine score system. This is because the proportion of this motif in the adenine learning set was relatively small (24/280). Also, it appeared that the orientation of the helix and the loop relative to base moiety were rather variable in adenine nucleotide-binding proteins than in guanine nucleotide-binding proteins. Since the current number of guanine samples (43) is much smaller than that of adenine (280), the presence of P-loop (G2-motif) in guanine-binding proteins might be deflated after an accumulation of a comparable number of examples with adenine.
The carbonyl-O (G8 in Figure 2b) and amino-N (G11a) peaks close to the HS edge were used by the Asn side chain in 12% (5/43) of the samples (G3-motif). Except for the preference for a helix in the G2-motif, the folds exhibited large variations among the proteins bearing these motifs, according to the SCOP classification (Murzin et al., 1995
) (examples are shown in Figure 4c and d).
Nicotinamide motif
The nicotinamide binding motif is the least understood among the four nucleotides. The existence of co-binding molecules (substrates) and the variable conformations of the nicotinamide dinucleotide molecules made the definition of a motif difficult (Carugo and Argos, 1997
).
Observation of the score system suggested that the ribose group was recognized through the main-chain atoms. The highest peak was carbonyl-O (N3a in Figure 2c) with 47% (29/62) frequency. In 40% (25/62) of the cases, the peaks N1, N2, N3a and N4a came from a loop connecting two secondary structures (N1-motif in Figure 3c) and made H-bonds with the hydroxyl O2* and O3* atoms of the ribose.
The polar atom peaks, N3c, N13 and N4b, were found to be the H-bond partners of the O7 (acceptor) and N7 (donor) of the carbonyl-amide group on the same plane of the base moiety (Figure 2c). These peaks primarily came from the main-chain atoms and 21% (13/62) of the learning set had a loop of 14 residues that provides these atoms (N2-motif in Figure 3c). As an exception, one aromatic-N peak (N13) deviated from the base plane and was located close to the C7 atom and hence was probably involved in a
-electron H-bond.
Carbonyl-O (N9) and hydroxyl-O (N10) peaks were in the proximity of the C4 atom, which is directly involved in the redox reaction. These atoms were provided by the side chain of Thr, Tyr, Ser, Gln or Asn and might work as electron donors (N3-motif in Figure 3c). At least one of the two peaks was used by 40% (25/62) of the learning set. This region tends to be sparsely occupied by main-chain or hydrophobic side-chain atoms, probably to provide an open space for substrate access. Figure 4e and f show examples of the proteins with different folds that share the three motifs.
Flavin motif
Fraaije and Mattevi (2000)
proposed that a positively charged nitrogen atom or the N-terminus of an
-helix is involved in flavin recognition. In the flavin score system, peaks F12a and F4a interacted with the N1 atom and F12b and F4b interacted with the N5 atom (Figure 2d). These peaks might represent the motif proposed by Fraaije and Mattevi (F2-motif in Figure 3d).
Three additional motifs were found in the score system. Remarkable peak distributions of the main-chain atoms were found close to the N1O2N3 edge (Figure 2d). The peak of the main-chain amide-N (F4a) might serve as the H-bond donor for the O2 atom and was used by 64% (42/66) of the learning set. The main-chain-O (F3a) was the H-bond acceptor of the N3 atom, with 47% (31/66) frequency. In addition, main-chain aliphatic-C (F1a) and carbonyl-C (F2) peaks were observed in this region, implying that certain main-chain structures were involved in this interaction.
A visual inspection revealed two types of main-chain structures for this interaction. The F1-motif is composed of 13 contiguous residues (mainly from a ß-strand), which H-bonded to the O2 acceptor and the N3 donor (Figure 3d). On the other hand, the F2-motif used the N-terminus of the
-helix, close to the N1 atom of flavin (Figure 2d and 3d). The F1- and F2-motifs were mutually exclusive and were used by 20% (13/66) and 30% (20/66) of the learning set, respectively. Examples of proteins with these motifs are shown in Figure 4g and h.
Carbonyl-O (F9a in Figure 2d) and hydroxyl-O (F10) peaks were found close to the N5 atom, which is used for the redox reaction. One of these peaks was used by 41% (27/66) of the learning set and Ser or Glu was preferred for this motif (F3-motif in Figure 3d). Similarly to the case of nicotinamide, this region was sparsely occupied by main-chain or hydrophobic side-chain atoms. Figure 4g shows examples of proteins with motifs F13.
Another carboxyl-O peak (F9b in Figure 2d) was found on the opposite side of flavin base. This peak represented the interaction between Asp/Glu and the ribitol O3 atom (F4-motif in Figure 3c) and was used by 27% (18/66) of the learning set. Figure 4h shows one example of an F4-motif protein, fumarate reductase (
-helical ferredoxin superfamily).
Differences between adenine and guanine recognition
Three features appeared to facilitate the discrimination between adenine and guanine from the comparison between their score systems (Figure 2a and b). First, the recognition of the ribose and phosphate groups showed remarkable differences between these two bases. The region proximal to the RB edge was occupied by high peaks of the main-chain-N (A4b in Figure 2a) and carboxyl-O of the Asp/Gln residues (A9b) for adenine, whereas these peaks were not observed for guanine. On the other hand, several polar atom peaks were found around the phosphate groups for guanine (G4, G10a and G12a in Figure 2b), but not for adenine. These peaks are used to bind the ribose and phosphate groups of the nucleotide. Although these groups are identical and do not show any significant difference in the conformational preference between adenine and guanine (Moodie and Thornton, 1993
), the modes of recognition appeared to differ between the two nucleotides.
Second, the adenine base was more frequently recognized through a
-electron H-bond than the guanine base. Charged-N (A12 in Figure 2a) and aromatic-N (A13) peaks were observed for adenine (A3-motif), which might make
-electron interactions with the base, but significant peaks for the same atoms were not observed for guanine.
Third, the main-chain atoms were preferred for adenine recognition (A3a, A4a and A3b in Figure 2a), whereas the side-chain atoms were used for guanine (G8, G9, G10b, G11a, G11b and G12b in Figure 2b). The A1- and G1-motifs occupy the same positions relative to the base moiety and both motifs are employed for H-bonding with the WG edges. However, adenine was mainly recognized by the main-chain atoms, whereas the Asp/Glu side chain was used for guanine.
Differences between purine and redox base recognition
The recognition modes for the purine (adenine and guanine) and redox (flavin and nicotinamide) bases showed significant differences. In the case of the two purine bases, the distribution of the hydrophobic atom peaks was mostly confined to the regions that are suitable for stacking interactions as proposed by the fuzzy recognition model (Moodie et al., 1996
) (A5a, A5b, A7a and A7b peaks in Figure 2a and G5a, G5b, G7a and G7b peaks in Figure 2b). For the flavin and nicotinamide bases, however, the localization of the hydrophobic atom peaks was not significant. Although base-stacking interactions were also observed for these bases (e.g., N5b, N7a and N7c peaks in Figure 2c and F5a, F7a and F7c peaks in Figure 2d), the peaks with comparable amplitudes were scattered around the base groups (e.g., N5b and N7b peaks in Figure 2c and F5b and F7b peaks in Figure 2d). The difference might reflect the fact that the substrate-binding sites are often located close to the nucleotide-binding sites in flavin and nicotinamide proteins and the score system might have captured them.
The carbonyl-O (N9 and F9a in Figure 2d and c, respectively) and hydroxyl-O (N10 and F10 in Figure 2d and c, respectively) peaks were also characteristic of the redox bases. These atoms are thought to serve as the electron pathway for the bases.
Overall coverage of the motifs
The motifs detected in the score system were only partly retained in most of the proteins. Among the structures in the learning set, 2 (adenine)9% (guanine) had a complete set of the mentioned motifs (F1- and F2-motifs for flavin are mutually exclusive; F1F3F4 and F2F3F4 combinations are each considered as a complete set), 16 (adenine/guanine)23% (flavin) lacked one of the motifs and 24 (flavin)31% (nicotinamide) had only one of the motifs. Therefore, at least one motif was conserved in 63% (adenine), 56% (guanine), 68% (nicotinamide) and 65% (flavin) of the learning set (Figure 3; the details of motif possession for each learning set protein are summarized in the supplementary Table I). The differences in the proportions for the motif possession patterns were not large among the different bases.
Application of an empirical score for binding site prediction
The empirical score was used as the prediction system of the nucleotide-binding site and the system was tested on known complex structures. The test targets were selected from the learning set (277 for adenine, 41 for guanine, 59 for nicotinamide and 63 for flavin). The prediction system output the top 100 predictions for each target protein, as the coordinates of the base moieties placed on the target protein.
Prediction test evaluation
The prediction results are summarized in Figure 5. A prediction was considered to be successful if the predicted base with the best score was within a 3.0 Å r.m.s.d. from the known base position. Figure 6a shows an example of correct prediction. The success rate of prediction was highest for flavin, 40% (25/63). The second best was 32% (19/59) for nicotinamide, which was followed by adenine, 31% (87/277). The rate for guanine 29% (12/41) was the lowest. Although these rates increased as more candidates were taken into account (Figure 5a), they did not increase significantly beyond the top three predictions. If the condition for success was tolerated and the top three candidates were allowed, then the success rates of prediction increased to 47% (130/277), 42% (17/41), 42% (25/59) and 51% (32/63) for adenine, guanine, nicotinamide and flavin, respectively.
|
|
Figure 5b shows the distribution of the r.m.s.d. values between the best-score prediction and the known ligand. When the r.m.s.d. is <1.0 Å, the predicted base position might be reliable for identifying the protein atoms that make specific interactions with the base group. However, only 510% of the predictions achieved this precision. If the threshold was raised to 5.0 Å, then the details of interactions in the predicted structure would not be reliable, although it could still be used to allocate roughly the position of the binding site. About 40% of the predictions fell into this range of prediction accuracy. The test results for each protein are listed in the supplementary Table I.
The performance of this prediction system was compared with that of Kuttner et al. (2003)
. The predictions for 22 adenine nucleotide-binding sites on 20 proteins (Tables VII and VIII in the paper by Kuttner et al., 2003
) were compared with the result of this work under the similar conditions as far as possible (supplementary Table II). As a result, the prediction system in this work showed roughly equal performance with the previous one; both systems detected 20 out of 22 binding sites. The two false cases were different between the two systems (supplementary Table II), which suggested that the combinations of the two methods might increase the performance.
Motifs and predictability
The empirical score system is expected to work particularly well on the proteins with motifs. The test results were also evaluated by dividing the samples into two groups, the proteins with more than two of the mentioned motifs (indicated by M in the category column in supplementary Table I) and those with one or no motif (indicated by u in supplementary Table I). As expected, the prediction efficiency for the poor motif group was less than half of that for the canonical motif group (Figure 4c and d).
The TP/(TP + FP) rate (TP, true positive; FP, false positive) of the predictions was analyzed. For each of the 386 target proteins, the top 10 predictions were picked up for the analysis. Figure 7 shows the TP/(TP + FP) rate of predictions against relative score. Relative score is the score of a predicted binding site divided by the highest score of the same base type for the same target protein. The TP/(TP + FP) rate provides the relationship between expected fraction of true predictions against score threshold. The plot shows that the rate is highest at the relative score range 0.951.00. One can expect about 28% of the predictions would be within 3.0 Å r.m.s.d. from the known ligands if a relative score of 0.95 is selected to be a threshold. The rate decayed rapidly, meaning that the risk of false prediction uniformly increased, as the relative score decreased. The rate was virtually zero below a relative score of 0.5, which defined the lower limit of prediction reliability. The TP/(TP + FP) rate indicated that the empirical score worked to highlight the real sites against others.
|
However, even at the highest relative score, the predictions were contaminated with a considerable number of false positives. To understand the causes of the false-positive predictions, the test targets with poor prediction results (>3.0 Å r.m.s.d. from the known ligand) were inspected visually. Examples of some exceptional cases, i.e. a false result for a canonical motif and a correct result for a protein with no motif, are shown in Figure 6. In many of the false cases, the best prediction shared the center of weight with the known ligand, but the base was rotated, which increased the r.m.s.d. value. In other frequent cases, the base plane of the known ligand was correctly detected; however, translation along the plane increased the r.m.s.d.
It appeared that the high-score peaks of the hydrophobic atoms that are used for base-stacking interactions are responsible for these false cases. Stacking interactions are one of the important interactions between proteins and base groups and were detected as aromatic- or aliphatic-C peaks in the score systems for all of the bases (Figure 2). This type of interaction was proposed as a fuzzy recognition motif (Moodie et al., 1996
). However, this interaction is less strict, in terms of the binding geometry, as compared with the H-bond and makes a best-score prediction that differs from the real binding position. Figure 6b shows a typical case. tRNAguanine transglycosylase (PDB code 1it7) uses the canonical G1-motif and Phe229 for base stacking with the guanine base. However, the score system preferred the proximal Phe99 for the base-stacking partner and made a false judgment.
Hence the base-stacking interaction, which is the most general and important motif for nucleotide binding, was unexpectedly found to be an obstacle in rational prediction. One possible strategy to increase the prediction efficiency might be an appropriate weighting between scores for hydrophobic and hydrophilic interactions.
| Conclusion |
|---|
|
|
|---|
By using nucleotides as examples, this work was intended to demonstrate mass data processing from known protein structures (PDB) to binding-motif categorization and prediction methods, which should be fully automated to facilitate the application range of ligand molecules. This process is concerned with the major utility of the large-scale compilation of structure information currently being generated by the structural genomics projects. So far, motif detection and binding-site prediction of nucleotides have been done on relatively specific sets of proteins with a certain amount of manual data handling. The results of this work provide insight into the capability of a generalized and mass production method for the same purposes. The results suggested that about 5668% of non-redundant proteins bear a classifiable binding motif, at least partially, and 2940% of the binding sites are empirically detectable.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org
| Acknowledgements |
|---|
|
|
|---|
This work was supported by a research grant endorsed by the New Energy and Industrial Technology Development Organization (NEDO).
| References |
|---|
|
|
|---|
Bellamacina,C.R. (1996) FASEB J., 11, 12571269.
Berman,H.M., Bhat,T.N., Bourne,P.E., Feng,Z., Gilliland,G., Wissig,H. and Westbrook,J. (2000) Nat. Struct. Biol., 7, s957s959.[CrossRef]
Cappello,V., Tramontano,A. and Koch,U. (2002) Proteins, 47, 106115.[CrossRef][ISI][Medline]
Carugo,O. and Argos,P. (1997) Proteins, 28, 1028.[CrossRef][ISI][Medline]
Denessiouk,K.A. and Johnson,M.S. (2000) Proteins, 38, 310326.[CrossRef][ISI][Medline]
Denessiouk,K.A. and Johnson,M.S. (2003) J. Mol. Biol., 333, 10251043.[CrossRef][ISI][Medline]
Denessiouk,K.A., Rantanen,V.V. and Johnson,M.S. (2001) Proteins, 44, 282291.[CrossRef][ISI][Medline]
Fraaije,M.W. and Mattevi,A. (2000) Trends Biochem. Sci., 25, 126132.[CrossRef][ISI][Medline]
Ishida,H., Shirai,T., Matsuda,Y., Kato,Y., Ohno,M., Isaji,T. and Yamane,T. (2000) J. Biochem., 128, 561574.
Kinoshita,K., Sadanami,K., Kidera,A. and Go,N. (1999) Protein Eng., 12, 1114.
Kobayashi,N. and Go,N. (1997a) Eur. Biophys. J., 26, 135144.[CrossRef][ISI][Medline]
Kobayashi,N. and Go,N. (1997b) Nat. Struct. Biol., 4, 67.[CrossRef][ISI][Medline]
Kuttner,Y.Y., Sobolev,V., Raskind,A. and Edelman,M. (2003) Proteins, 52, 400411.[CrossRef][ISI][Medline]
Mao,L., Wang,Y., Liu,Y. and Hu,X. (2004) J. Mol. Biol., 336, 787807.[CrossRef][ISI][Medline]
Moodie,S.L. and Thornton,J.M. (1993) Nucleic Acids Res., 21, 13691380.
Moodie,S.L., Mitchell,J.B. and Thornton,J.M. (1996) J. Mol. Biol., 263, 486500.[CrossRef][ISI][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Nobeli,I., Laskowski,R.A., Valdar,W.S. and Thornton,J.M. (2001) Nucleic Acids Res., 29, 42944309.
Saraste,M., Sibbald,P.R. and Wittinghofer,A. (1990) Trends Biochem. Sci., 15, 430434.[CrossRef][ISI][Medline]
Schulz,G.E. (1992) Curr. Opin. Struct. Biol., 2, 6167.
Shionyu-Mitsuyama,C., Shirai,T., Ishida,H. and Yamane,T. (2003) Protein Eng., 16, 467478.
Sobolev,V., Sorokine,A., Prilusky,J., Abola,E.E. and Edelman,M. (1999) Bioinformatics, 4, 327332.
Swindells,M.B. (1993) Protein Sci., 2, 21462153.[Abstract]
Walker,J.E., Saraste,M., Runswick,M.J. and Gay,N.J. (1982) EMBO J., 1, 945951.[ISI][Medline]
Westbrook,J., Feng,Z., Chen,L., Yang,H. and Berman,H.M. (2003) Nucleic Acids Res., 31, 489491.
Yokoyama,S. et al. (2000) Nat. Struct. Biol., 7, s943s945.[CrossRef]
Zhang,C. and Kim,S.H. (2003) Curr. Opin. Chem. Biol., 7, 2832.[CrossRef][ISI][Medline]
Zhao,S., Morris,G.M., Olson,A.J. and Goodsell,D.S. (2001) J. Mol. Biol., 314, 12451255.[CrossRef][ISI][Medline]
Received July 12, 2005; revised November 12, 2005; accepted November 12, 2005.
Edited by Haruki Nakamura
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






