The least-squares (LS) fitting procedures presented below make use of well known mathematics. Indeed, the methods are so well known and widely used that it is somewhat difficult to locate the original references. In our previous effort to resolve the discrepancies among nucleic acid conformational analysis programs, we came across a variety of LS fitting procedures. Here we provide a detailed description, with step-by-step examples, of our implementation in 3DNA of two LS fitting algorithms based on a covariance matrix and its eigen-system. This post is the revised version of a note first made available in the “Technical Details” section of earlier 3DNA websites.
LS fitting between standard and experimental bases
Three analysis schemes — CompDNA, Curves/Curves+, and RNA — use LS procedures to fit a standard base with an embedded reference frame to an observed base structure. CompDNA
and Curves/Curves+
take advantage of the conventional approach of McLachlan [“Least Squares Fitting of Two Structures.” J. Mol. Biol., 128, 74-79 (1979)], while the RNA
program implements a closed-form solution of absolute orientation using unit quaternions first introduced by Horn. The two algorithms are mathematically equivalent for the most general cases, since the unit quaternion can be transformed to the rotation matrix given by McLachlan. The Horn method, however, is more straightforward and generally applicable; it can be applied even when one or both of the structures are perfectly planar, whereas the McLachlan approach fails.
Here we use the ideal adenine geometry derived from the high resolution crystal structures of model nucleosides, nucleotides, and bases. The x-, y-, and z-coordinates of the standard base, taken from the NDB, are listed below in the columns labeled sx
, sy
, and sz
, respectively. s_(average)
is the geometric center of the base.
sx sy sz 1 N9 0.213 0.660 1.287 2 C4 0.250 2.016 1.509 3 N3 0.016 2.995 0.619 4 C2 0.142 4.189 1.194 5 N1 0.451 4.493 2.459 6 C6 0.681 3.485 3.329 7 N6 0.990 3.787 4.592 8 C5 0.579 2.170 2.844 9 N7 0.747 0.934 3.454 10 C8 0.520 0.074 2.491 ------------------------------------ s_(average): 0.4589 2.4803 2.3778
We similarly describe the coordinates of one of the adenine bases (the fifth nucleotide in the sequence strand) from the high resolution (1.4 Å) self-complementary d(CGCGAATTCGCG) dodecamer duplex determined by Williams and co-workers (PDB id: 355d). The experimental xyz coordinates are listed below in the columns labeled ex
, ey
, and ez
. The geometric center is e_(average)
. Note that the atomic serial numbers from the PDB (first column) have been rearranged so that the atoms are in the same order as those of the ideal base listed above.
ex ey ez 91 N9 16.461 17.015 14.676 100 C4 15.775 18.188 14.459 99 N3 14.489 18.449 14.756 98 C2 14.171 19.699 14.406 97 N1 14.933 20.644 13.839 95 C6 16.223 20.352 13.555 96 N6 16.984 21.297 12.994 94 C5 16.683 19.056 13.875 93 N7 17.918 18.439 13.718 92 C8 17.734 17.239 14.207 ------------------------------------ e_(average):16.1371 19.0378 14.0485
We collect the two sets of xyz coordinates in the 10 × 3 matrices S
and E
corresponding respectively to the standard and experimental bases. We then construct the 3 × 3 covariance matrix C
between S
and E
using the following formula:
1 1 C = ------- [S' E - --- S' i i' E] n - 1 n
= 0.2782 0.2139 -0.1601 -1.4028 1.9619 -0.2744 1.0443 0.9712 -0.6610
Here n
, the number of atoms in each base, is 10, and i
is an n x 1 column vector consisting of only ones. S'
and i'
are the transpose of matrix S
and column vector i
, respectively.
From the nine elements of the C
matrix, we subsequently generate the 4 × 4 real symmetric matrix M
using the expression:
| c11+c22+c33 c23-c32 c31-c13 c12-c21 | M = | c23-c32 c11-c22-c33 c12+c21 c31+c13 | | c31-c13 c12+c21 -c11+c22-c33 c23+c32 | | c12-c21 c31+c13 c23+c32 -c11-c22+c33 |
= 1.5792 -1.2456 1.2044 1.6167 -1.2456 -1.0228 -1.1890 0.8842 1.2044 -1.1890 2.3447 0.6968 1.6167 0.8842 0.6968 -2.9011
The largest eigenvalue of matrix M
is 4.0335, and its corresponding unit eigenvector is:
[ q0 q1 q2 q3 ] = [ 0.6135 -0.2878 0.7135 0.1780 ]
The rotation matrix R
is deduced from the above eigenvector as below:
| q0q0+q1q1-q2q2-q3q3 2(q1q2-q0q3) 2(q1q3+q0q2) | R = | 2(q2q1+q0q3) q0q0-q1q1+q2q2-q3q3 2(q2q3-q0q1) | | 2(q3q1-q0q2) 2(q3q2+q0q1) q0q0-q1q1-q2q2+q3q3 |
= -0.0817 -0.6291 0.7730 -0.1923 0.7710 0.6072 -0.9779 -0.0990 -0.1839
Following coordinate transformation with matrix R
, the origin of the standard base is found to be displaced from the experimental structure by:
o = e_(average) - s_(average) R' = [15.8969 15.7701 15.1802]
The least-squares fitted coordinates (F
) of the standard base atoms on the experimental structure are then given by:
F = S R' + i o = 16.4592 17.0194 14.6699 15.7747 18.1925 14.4586 14.4899 18.4519 14.7542 14.1729 19.6974 14.4070 14.9343 20.6404 13.8420 16.2222 20.3472 13.5569 16.9832 21.2875 12.9925 16.6829 19.0585 13.8760 17.9183 18.4437 13.7219 17.7335 17.2396 14.2062
Here S
is the (n x 3) matrix of original coordinates of the standard base, and as noted above, i
is an n x 1 column vector consisting of only ones.
The difference matrix (D
) between F
and E
, the (n x 3) matrix of original coordinates of the experimental base, and the root-mean-square (RMS) deviation between the two structures are found as:
D = E - F = 0.0018 -0.0044 0.0061 0.0003 -0.0045 0.0004 -0.0009 -0.0029 0.0018 -0.0019 0.0016 -0.0010 -0.0013 0.0036 -0.0030 0.0008 0.0048 -0.0019 0.0008 0.0095 0.0015 0.0001 -0.0025 -0.0010 -0.0003 -0.0047 -0.0039 0.0005 -0.0006 0.0008 RMS deviation = 0.0054
It should be noted that if the standard base is already defined in terms of its reference frame, as in 3DNA (e.g., $X3DNA/config/Atomic_A.pdb
), the vector o
and the matrix R
represent the best-fitted coordinate frame of the experimental base. Moreover, the three axes of the frame given by R
are guaranteed to be orthonormal. If you want to get an insight of the LS fitting algorithm and a better understanding of how 3DNA derives its base reference frame, it’d be a valuable experience to repeat the above procedure with $X3DNA/config/Atomic_A.pdb
.
Note: the algorithm does not apply to a molecule vs its inversion (an improper rotation) — thanks to Boris Averkiev for reporting this subtle point (see comments below). One possible remedy is to treat this edge case separately.
Base normal
Rather than fit a standard base to experimental coordinates, the CEHS, FREEHELIX, and NUPARM analyses perform a fitting of a LS plane to a set of atoms in order to define the base and base-pair normals. The covariance matrix based on the n x 3 matrix of experimental Cartesian coordinates E
is diagonalized to find the vector normal to the best plane. Specifically, C
is obtained using the above formula with S
substituted by E
. The normal vector then lies along the eigenvector that corresponds to the smallest eigenvalue. Note that the coefficient 1/(n-1)
in the formula for calculating C
has no effect on the direction of the eigenvectors but scales the magnitudes of the eigenvalues.
Using the above adenine base from the high resolution dodecamer duplex as an example, the covariance matrix C
is:
C = 1.6680 -0.5015 -0.3253 -0.5015 2.0670 -0.5840 -0.3253 -0.5840 0.3061
The smallest eigenvalue of C
, 8.26e-5
, indicates that the base is almost perfectly planar. The corresponding unit eigenvector corresponding to the base normal is:
Base normal: 0.2737 0.3224 0.9062