Automatic identification of nucleotides

Any analysis of nucleic acid structures start with the identification of nucleotides (nts), the basic building unit. As per the PDB convention, each nt (like any other ligands) is specified by a three-letter identifier. For example, the four standard RNA nts are ..A, ..C, ..G, and ..U, respectively. The four corresponding standard DNA nts are .DA, .DC, .DG, and .DT, respectively. Note that here, for visualization purpose, each space is represented by a dot (.). In practice, the following codes for the five standard DNA/RNA nts — ADE, CYT, GUA, THY, and URA — are also commonly encountered, among other variants.

On top of the standard nts, there are numerous modified ones, each assigned a unique three-letter code. In the classic yeast phenylalanine tRNA (PDB id: 1ehz), 14 out of the 76 nts are modified, as shown in Fig. 1 below.

Modified nucleotides in yeast tRNA
Fig. 1: Modified nucleotides in yeast phenylalanine tRNA 1ehz

It is challenging to maintain a comprehensive and updated list of ever-inceasing nts encountered in the PDB and molecular dynamics (MD) simulation packages (e.g., AMBER, GROMACS, and CHARMM). Thus, as of today, some well-known DNA/RNA structural bioinformatics tools can handle only standard nts or a limited list of modified ones.

From early on in the development of 3DNA, I observed that all recognized nts have a core six-membered ring, with atoms named N1,C2,N3,C4,C5,C6 consecutively (see Fig. 2 below). Purines have three additional atoms, named N7,C8,N9. So it is feasible to automatically identify nts, and classify them as pyrimidines and purines, based on the common core skeleton shared by all of them. Moreover, the ‘skeleton’ is not effected by any possible tautomeric or protonation state.

Common names of core base atoms
Fig. 2: Identification of nts in 3DNA/DSSR based on atomic names and planar geometry

Early versions of 3DNA employed only three atoms (N1, C2 and C6) and three distances to decide a nt. Purines were further discriminated by the N9 atom, and the N1–N9 distance. While developing DSSR, I revised the nt-identification algorithm by using a least-squares fitting procedure that makes use of all available base ring atoms instead of selected ones. The same new algorithm has also been adapted into the find_pair/analyze etc programs in 3DNA, as of v2.2.

As always, the idea can be best illustrated with a worked example. Guanine in its standard base reference frame, with the following list of nine ring atoms coordinates, is chosen for the least-squares fitting. See file Atomic_G.pdb in the 3DNA distribution, and also Table 1 of the report A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry.

ATOM      2  N9    G A   1      -1.289   4.551   0.000
ATOM      3  C8    G A   1       0.023   4.962   0.000
ATOM      4  N7    G A   1       0.870   3.969   0.000
ATOM      5  C5    G A   1       0.071   2.833   0.000
ATOM      6  C6    G A   1       0.424   1.460   0.000
ATOM      8  N1    G A   1      -0.700   0.641   0.000
ATOM      9  C2    G A   1      -1.999   1.087   0.000
ATOM     11  N3    G A   1      -2.342   2.364   0.001
ATOM     12  C4    G A   1      -1.265   3.177   0.000

By using a ls-fitting procedure, only (any) three atoms are needed. We no longer need to make explicit selection, as we did previously (N1,C2,C6 and N9), thus allowing for possible modification on these atoms.

Using four nts (G1, 2MG10, H2U16, and PSU39, see Fig. 1 above top) of 1ehz as examples, the following list gives the atomic coordinates of base ring atoms, and root-mean-squres devisions (rmsd) of the least-squares fit. Of course, when performing least-squares fitting, the names of corresponding atoms must match (note the different ordering of atoms for H2U and PSU in the list vs the above standard G reference).

#G1, rmsd=0.008
ATOM     14  N9    G A   1      51.628  45.992  53.798  1.00 93.67           N  
ATOM     15  C8    G A   1      51.064  46.007  52.547  1.00 92.60           C  
ATOM     16  N7    G A   1      51.379  44.966  51.831  1.00 91.19           N  
ATOM     17  C5    G A   1      52.197  44.218  52.658  1.00 91.47           C  
ATOM     18  C6    G A   1      52.848  42.992  52.425  1.00 90.68           C  
ATOM     20  N1    G A   1      53.588  42.588  53.534  1.00 90.71           N  
ATOM     21  C2    G A   1      53.685  43.282  54.716  1.00 91.21           C  
ATOM     23  N3    G A   1      53.077  44.429  54.946  1.00 91.92           N  
ATOM     24  C4    G A   1      52.356  44.836  53.879  1.00 92.62           C  
#2MG10, rmsd=0.018
HETATM  207  N9  2MG A  10      61.581  47.402  18.752  1.00 42.14           N  
HETATM  208  C8  2MG A  10      62.199  48.621  18.635  1.00 40.38           C  
HETATM  209  N7  2MG A  10      63.494  48.534  18.422  1.00 40.70           N  
HETATM  210  C5  2MG A  10      63.745  47.167  18.395  1.00 43.82           C  
HETATM  211  C6  2MG A  10      64.965  46.449  18.205  1.00 43.45           C  
HETATM  213  N1  2MG A  10      64.767  45.086  18.293  1.00 44.71           N  
HETATM  214  C2  2MG A  10      63.541  44.482  18.486  1.00 47.21           C  
HETATM  217  N3  2MG A  10      62.411  45.125  18.614  1.00 45.85           N  
HETATM  218  C4  2MG A  10      62.574  46.451  18.582  1.00 43.27           C  
#H2U16, rmsd=0.188
HETATM  336  N1  H2U A  16      77.347  53.323  34.582  1.00 91.19           N  
HETATM  337  C2  H2U A  16      76.119  52.865  34.160  1.00 92.39           C  
HETATM  339  N3  H2U A  16      75.123  52.894  35.107  1.00 93.28           N  
HETATM  340  C4  H2U A  16      75.289  52.711  36.458  1.00 93.34           C  
HETATM  342  C5  H2U A  16      76.696  52.479  36.909  1.00 93.77           C  
HETATM  343  C6  H2U A  16      77.717  53.238  36.039  1.00 93.22           C  
#PSU39, rmsd=0.004
HETATM  845  N1  PSU A  39      74.080  36.066   5.459  1.00 75.82           N  
HETATM  846  C2  PSU A  39      74.415  36.835   4.354  1.00 75.59           C  
HETATM  847  N3  PSU A  39      75.735  36.769   3.984  1.00 76.29           N  
HETATM  848  C4  PSU A  39      76.728  36.038   4.591  1.00 77.28           C  
HETATM  849  C5  PSU A  39      76.307  35.280   5.732  1.00 77.93           C  
HETATM  850  C6  PSU A  39      75.025  35.316   6.112  1.00 76.07           C  

As noted in the DSSR paper, the rmsd is normally <0.1 Å since base rings are rigid. To account for experimental error and special non-planar cases, such as H2U in 1ehz, the default rmsd cutoff is set to 0.28 Å by default.

With the above detailed algorithm, DSSR (and the 3DNA find_pair/analyze programs) can automatically identify virtually all ‘recognizable’ nts in the PDB. A survey performed in June 2015 detected 630 different types of modified nucleotides in the PDB.

It is worth noting the following points:

  • The choice of standard G instead of A as the reference base has no impact on the results. As a matter of fact, the rmsd between G and A is only 0.04 Å. Note also the generous default cutoff of 0.28 Å.
  • The method obviously depends on proper naming of the ring atoms. Specially, the base ring atoms must be named N1,C2,N3,C4,C5,C6 consecutively, with purines having three additional atoms named N7,C8,N9. Thus, under this scheme, TPP (thiamine diphosphate) would not be recognized as a nt by default, simply because of the extra prime (′) of atoms in the six-membered ring. In nucleic acid structures, the prime symbol is normally associated with atoms of the sugar moiety (e.g., the C5′ atom).

Molecular image of TPP (thiamine diphosphate)
Fig. 3: TPP (thiamine diphosphate) would not be recognized as a nt.

  • On the other hand, nt cofactors in an otherwise ‘pure’ protein structure will also be recognized. One example is the two AMP (adenosine monophosphate) ligands in PDB entry 12as. This extra identification of nts does no harm in such cases. As shown in the analysis of the SAM-I riboswitch in the DSSR paper, taking the SAM ligand as a nt in base triplet recognition is a neat feature.
  • Once a nucleotide has been identified and classified into purines and pyrimidines, exocyclic atoms can be used for further assignment: O6 or N2 distinguishes guanine from adenine, N4 separates cytosine from thymine and uracil, and C7 (or C5M, the methyl group) differentiates thymine from uracil. For some modified nts, the distinctions within purines or pyrimidines may not be that obvious. For example, inosine may be taken as a modified guanine or adenine. However, this ambiguity does not pose any significant effect on the calculated base-pair parameters.
  • In DSSR and 3DNA, each identified nt is assigned a one-letter shorthand code: the standard ..A, .DA, and ADE (among a few other common variations) is shortened to upper-case A, and similarly for C, G, T, and U. Modified nts, on the other hand, are shortened to their corresponding lower-case symbol. For example, modified guanine such as 2MG and M2G in the yeast phenylalanine tRNA (see Fig. 1 above) is assigned g. So in 3DNA/DSSR output, the upper and lower cases of bases (e.g., nts=3 gCG A.2MG10,A.C25,A.G45) convey special meanings.

Related topics:





Thank you for printing this article from Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu