It gives me great pleasure to announce that the 3DNA/DSSR project is now funded by the NIH R24GM153869 grant, titled "X3DNA-DSSR: a resource for structural bioinformatics of nucleic acids". I am deeply grateful for the opportunity to continue working on a project that has basically defined who I am. It was a tough time during the funding gap over the past few years. Nevertheless, I have experienced and learned a lot, and witnessed miracles enabled by enthusiastic users.
Since late 2020 when I lost my R01 grant, DSSR has been licensed by the Columbia Technology Ventures (CTV). I appreciate the numerous users (including big pharma) who purchased a DSSR Pro License or a DSSR Basic paid License. Thanks to the NIH R24GM153869 grant, we are pleased to provide DSSR Basic free of charge to the academic community. Academic Users may submit a license request for DSSR Basic or DSSR Pro by clicking "Express Licensing" on the CTV landing page. Commercial users may inquire about pricing and licensing terms by emailing techtransfer@columbia.edu, copying xiangjun@x3dna.org.
The current version of DSSR is v2.4.5-2024sep24 which contains miscellaneous bug fixes (e.g., chain id with > 4 chars) and minor improvements. This release synchronizes with the new R24 funding, which will bring the project to the next level. All existing users are encouraged to upgrade their installation.
Lots of exciting things will happen for the project. The first thing is to make DSSR freely accessible to the academic community. In the past couple of weeks, CTV have already issued quite a few DSSR Basic Academic licenses to users from all over the world. So the demand is high, and it will become stronger as more academic users become aware of DSSR. I'm closely monitoring the 3DNA Forum, and is always ready to answer users questions.
I am committed to making DSSR a brand that stands for quality and value. By virtue of its unmatched functionality, usability, and support, DSSR saves users a substantial amount of time and effort when compared to other options. My track record throughout the years has unambiguously demonstrated my dedication to this solid software product.
DSSR Basic contains all features described in the three DSSR-related papers, and includes the originally separate SNAP program (still unpublished) for analyzing DNA/RNA-protein complexes. The Pro version integrates the classic 3DNA functionality, plus advanced modeling routines, with email/Zoom/phone support.
From early on, DSSR-derived nucleic acid secondary structures have been written in the compact dot-bracket notation (.dbn) with pseudo-knot information. To better connect DSSR to the 2D world, I recently looked into the connect (.ct) format, which was first introduced by Zuker’s mfold program. Over time, the .ct format has become one of the most commonly used RNA secondary structure formats, and it is more expressive than the .dbn format (see below).
As of v1.0, for each analyzed structure, DSSR produces two secondary structure files with default names dssr-2ndstrs.dbn
and dssr-2ndstrs.ct
, in .dbn and .ct formats, respectively. Using the 27-nucleotides (nt) RNA fragment 1msy as an example, the DSSR-derived secondary structure in .dbn and .ct formats are shown below:
In dot-bracket notation (.dbn) [dssr-2ndstrs.dbn]
------------------------------------------------------
>1msy nts=27 DSSR-derived secondary structure
UGCUCCUAGUACGUAAGGACCGGAGUG
.(((((.....(....)....))))).
------------------------------------------------------
In connect format (.ct) [dssr-2ndstrs.ct]
------------------------------------------------------
27 DSSR-derived secondary structure in '1msy'
1 U 0 2 0 2647 # name=A.U2647
2 G 1 3 26 2648 # name=A.G2648, pairedNt=A.U2672
3 C 2 4 25 2649 # name=A.C2649, pairedNt=A.G2671
4 U 3 5 24 2650 # name=A.U2650, pairedNt=A.A2670
5 C 4 6 23 2651 # name=A.C2651, pairedNt=A.G2669
6 C 5 7 22 2652 # name=A.C2652, pairedNt=A.G2668
7 U 6 8 0 2653 # name=A.U2653
8 A 7 9 0 2654 # name=A.A2654
9 G 8 10 0 2655 # name=A.G2655
10 U 9 11 0 2656 # name=A.U2656
11 A 10 12 0 2657 # name=A.A2657
12 C 11 13 17 2658 # name=A.C2658, pairedNt=A.G2663
13 G 12 14 0 2659 # name=A.G2659
14 U 13 15 0 2660 # name=A.U2660
15 A 14 16 0 2661 # name=A.A2661
16 A 15 17 0 2662 # name=A.A2662
17 G 16 18 12 2663 # name=A.G2663, pairedNt=A.C2658
18 G 17 19 0 2664 # name=A.G2664
19 A 18 20 0 2665 # name=A.A2665
20 C 19 21 0 2666 # name=A.C2666
21 C 20 22 0 2667 # name=A.C2667
22 G 21 23 6 2668 # name=A.G2668, pairedNt=A.C2652
23 G 22 24 5 2669 # name=A.G2669, pairedNt=A.C2651
24 A 23 25 4 2670 # name=A.A2670, pairedNt=A.U2650
25 G 24 26 3 2671 # name=A.G2671, pairedNt=A.C2649
26 U 25 27 2 2672 # name=A.U2672, pairedNt=A.G2648
27 G 26 0 0 2673 # name=A.G2673
------------------------------------------------------
Presumably, the .ct format is very simple, and examining a sample file as shown above would give one a pretty good sense of what each column is about. While there exist many oversimplified descriptions of the .ct format on the web, the most detailed and accurate explanation is from the mfold manual:
The ``ct’‘ file (connect table) contains the sequence and base pair information, and is meant to be an input file for a structure drawing program. In addition to containing base pair information, it also lists the 5′ and 3′ neighbor of each base, allowing for the representation of circular RNA or multiple molecules. The ct file also lists the historical base numbering in the original sequence, as bases and base pairs are numbered according from 1 to the size of the folded segment. A portion of a ct file is displayed in Figure 12.
Figure 12: The ct file for the second and final folding of S. cerevisiae Phe-tRNA at 37°, with default parameters. The first record displays the fragment size (76), ΔG and sequence name. The ith subsequent record contains, in order, i, ri, the index of the 5′-connecting base, the index of the 3′-connecting base, the index of the paired base and the historical numbering of the ith base in the original sequence. The 5′, 3′ and base pair indices are 0 when there is no connection or base pair.
Specifically, the 3rd, 4th, and 6th columns in the .ct format convey specific information; by design, they are not redundant to information contained in the 1st column. Note that in the above ‘1msy’ example, the 6th column gives the nt sequence numbers (as in the PDB datafile) instead of the serial numbers (as in the 1st column). The DSSR produced .ct files also contain extra information after ‘#’, in the comma separated key=value format.
As an example of the usefulness of the 3rd and 4th columns, have a look of the DSSR-derived .ct file for the Dickerson DNA dodecamer duplex with sequence CGCGAATTCGCG:
24 DSSR-derived secondary structure in '355d'
1 C 0 2 24 1 # name=A.DC1, pairedNt=B.DG24
2 G 1 3 23 2 # name=A.DG2, pairedNt=B.DC23
3 C 2 4 22 3 # name=A.DC3, pairedNt=B.DG22
4 G 3 5 21 4 # name=A.DG4, pairedNt=B.DC21
5 A 4 6 20 5 # name=A.DA5, pairedNt=B.DT20
6 A 5 7 19 6 # name=A.DA6, pairedNt=B.DT19
7 T 6 8 18 7 # name=A.DT7, pairedNt=B.DA18
8 T 7 9 17 8 # name=A.DT8, pairedNt=B.DA17
9 C 8 10 16 9 # name=A.DC9, pairedNt=B.DG16
10 G 9 11 15 10 # name=A.DG10, pairedNt=B.DC15
11 C 10 12 14 11 # name=A.DC11, pairedNt=B.DG14
12 G 11 0 13 12 # name=A.DG12, pairedNt=B.DC13
13 C 0 14 12 13 # name=B.DC13, pairedNt=A.DG12
14 G 13 15 11 14 # name=B.DG14, pairedNt=A.DC11
15 C 14 16 10 15 # name=B.DC15, pairedNt=A.DG10
16 G 15 17 9 16 # name=B.DG16, pairedNt=A.DC9
17 A 16 18 8 17 # name=B.DA17, pairedNt=A.DT8
18 A 17 19 7 18 # name=B.DA18, pairedNt=A.DT7
19 T 18 20 6 19 # name=B.DT19, pairedNt=A.DA6
20 T 19 21 5 20 # name=B.DT20, pairedNt=A.DA5
21 C 20 22 4 21 # name=B.DC21, pairedNt=A.DG4
22 G 21 23 3 22 # name=B.DG22, pairedNt=A.DC3
23 C 22 24 2 23 # name=B.DC23, pairedNt=A.DG2
24 G 23 0 1 24 # name=B.DG24, pairedNt=A.DC1
Note the 0 at the 4th column for A.DG12 which is at the 3′ end of chain A, and the 0 at 3rd column for B.DC13 which is at the 5′ end of chain B.
From early on, 3DNA calculates the Zp parameter to separate A- and B-DNA double helical steps. First introduced in the paper A-form conformational motifs in ligand-bound DNA structures (see figure below), Zp is the mean projection of the two phosphorus atoms onto the z-axis of the dimer ‘middle frame’. Zp is greater than 1.5 Å for A-DNA, and it is less than 0.5 Å for B-DNA. As noted in the 3DNA NAR paper, other parameters such as slide should also be examined to confirm conformational assignments based on Zp.
As of v2.1, 3DNA has introduced the single-stranded variant for the Zp parameter (ssZp) as a more robust substitute for the Richardson phosphorus-glycosidic bond distance parameter (Dp) to characterize sugar puckers. See post Sugar pucker correlates with phosphorus-base distance for more details. In 3DNA/DSSR, ssZp is defined as the z-coordinate of the 3′ phosphorus atom expressed in the standard reference frame of the preceding base; it is positive when phosphorus lies on the +z-axis side (base in anti conformation) and negative if phosphorus is on the –z-axis side (base in syn conformation). Note that by definition, Dp should always be positive.
As in the previous post, here I am using G175 and U176 of PDB entry 1jj2 (the large ribosomal subunit of Haloarcula marismortui) as examples to illustrate how the ssZp parameters are calculated. The GpU forms a dinucleotide platform, where the sugar of G175 adopts a C2′-endo conformation, and that of U176 C3′-endo. For verification, here is the PDB data file for fragment 1jj2-G175-U176-A177.pdb (note A177 is included for its phosphorus atom). Run the following 3DNA commands:
find_pair -s 1jj2-G175-U176-A177.pdb stdout
frame_mol -1 ref_frames.dat 1jj2-G175-U176-A177.pdb ref-G175.pdb
frame_mol -2 ref_frames.dat 1jj2-G175-U176-A177.pdb ref-U176.pdb
File ref-G175.pdb
contains the following line:
ATOM 24 P U 0 176 -5.624 6.937 1.918 1.00 24.19 P
The z-coordinate of U176 (which is 3′ to G175) is 1.918, which is the ssZp for G175. It is less than 2.9 Å, corresponding to the C2′-endo sugar conformation of G175.
Similarly, file ref-U176.pdb
contains the following line:
ATOM 44 P A 0 177 -3.841 6.592 4.377 1.00 25.91 P
So the ssZp for U176 is 4.377, which is greater than 2.9 Å, corresponding to the C3′-endo sugar conformation of U176.
To sum up, the double-stranded Zp as originally available from 3DNA can be used for discriminating A- and B-DNA double-helical steps: Zp > 1.5 Å for A-DNA, and Zp < 0.5 Å for B-DNA. The newly introduced single-stranded Zp is intended for characterizing sugar puckers: Zp > 2.9 Å for C3′-endo, and Zp < 2.9 Å for C2′-endo. Since A-DNA has predominately C3′-endo sugar conformation and B-DNA has C2′-endo sugar, the ssZp parameter would be helpful in classifying a dinucleotide into A- or B-like conformation. A survey of ssZp in well-defined A- and B-DNA structures (as performed for double-stranded Zp) should prove useful.
Realizing the naming confusions of double-stranded Zp vs single-stranded Zp, I am considering to rename single-stranded Zp as ssZp in future releases of 3DNA and DSSR. Do you have any comments or suggestions? Please let me know by leaving a comment!
Recently I was surprised by some cases of nucleotides with missing atoms in PDB entry 1pns. The story started like this: 3DNA/DSSR maps various nucleotide names to one-letter codes, based on the data file baselist.dat
(see post Modified nucleotides in the PDB). In the meantime, 3DNA/DSSR internally assigns a nucleotide as either purine or pyrimidine, by virtue of coordinates of base atoms. Be definition, purines should only include A/a/G/g/I/i
, and pyrimidines C/c/T/t/U/u/P/p
. However, no consistency check has been implemented in DSSR until just now.
I first noticed the inconsistency between residue name and atom coordinates for nucleotide A6 on chain U (hereafter referred to as U.A6) in 1pns. The nucleotide has standard name ‘ A’, obviously a purine. However, somehow DSSR classified it as a pyrimidine based on atomic coordinates. Upon further check of the PDB data file, I found the following remarks:
REMARK 470 MISSING ATOM
REMARK 470 THE FOLLOWING RESIDUES HAVE MISSING ATOMS(M=MODEL NUMBER;
REMARK 470 RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
REMARK 470 I=INSERTION CODE):
REMARK 470 M RES CSSEQI ATOMS
REMARK 470 A U 6 N9 C8 N7
REMARK 470 G U 8 N9 C8 N7
REMARK 470 A U 12 N9 C8 N7
REMARK 470 A U 13 N9 C8 N7
REMARK 470 A U 14 N9 C8 N7
The atomic coordinates for U.A6 are as below:
ATOM 34447 P A U 6 81.861 37.210 78.651 1.00378.87 P
ATOM 34448 OP1 A U 6 80.631 37.121 77.831 1.00378.87 O
ATOM 34449 OP2 A U 6 81.665 37.221 80.119 1.00378.87 O
ATOM 34450 O5' A U 6 82.707 38.495 78.212 1.00378.87 O
ATOM 34451 C5' A U 6 83.948 38.777 78.887 1.00378.87 C
ATOM 34452 C4' A U 6 84.600 40.000 78.276 1.00378.87 C
ATOM 34453 O4' A U 6 84.975 39.698 76.901 1.00378.87 O
ATOM 34454 C3' A U 6 83.714 41.239 78.153 1.00378.87 C
ATOM 34455 O3' A U 6 83.654 41.968 79.369 1.00378.87 O
ATOM 34456 C2' A U 6 84.403 42.015 77.020 1.00378.87 C
ATOM 34457 O2' A U 6 85.564 42.655 77.474 1.00378.87 O
ATOM 34458 C1' A U 6 84.834 40.864 76.105 1.00378.87 C
ATOM 34459 C5 A U 6 82.033 39.296 74.209 1.00378.87 C
ATOM 34460 C6 A U 6 82.941 39.553 75.166 1.00378.87 C
ATOM 34461 N6 A U 6 81.170 39.949 72.090 1.00378.87 N
ATOM 34462 N1 A U 6 83.830 40.588 75.041 1.00378.87 N
ATOM 34463 C2 A U 6 83.843 41.410 73.939 1.00378.87 C
ATOM 34464 N3 A U 6 82.899 41.124 72.974 1.00378.87 N
ATOM 34465 C4 A U 6 81.968 40.108 73.016 1.00378.87 C
No atom records for N7, C8 and N9. So far, so good. However, surprise came when I visualized U.A6 in Jmol, as shown in the following image. Note here atom N1 is connected to C1’ as in pyrimidines, and N6 is bonded to C4!
The same issue also exists for U.G8 (see figure below), U.A12, U.A13, and U.A14.
It is beyond my imagination to understand why such weird cases exist in the PDB, even given the lousy resolution (8.7 Å) of 1pns.
I recently upgraded my Macs to OS X Mavericks to check if 3DNA/DSSR works in the new operating system. I am glad to report that both run without a hitch, as expected.
Since OS X Mavericks is free from the Mac App Store, it will quickly become the de facto version virtually all Mac users would use. I also noticed that Ruby on Mavericks has been upgraded to ruby 2.0.0p247 (2013-06-27 revision 41674)
, a major step forward from the now retiring Ruby 1.8.7 distributed in previous versions of Mac OS X.
As a rule, I’d ensure that 3DNA/DSSR executes properly in major releases of the commonly used operating systems — Mac, Windows, and Linux.
While having not used DOS for ages, I am glad to find that the DSSR version compiled for MinGW/MSYS on Windows works perfectly under this operating system (see screenshot below). The DSSR DOS command-line interface functions exactly the same as for Linux, Mac OS X, MinGW/MSYS, and CygWin. Among other possible usages, it allows for batch files to take advantage of DSSR.
Implementing DSSR in strict ANSI C as a self-contained and zero-dependent command-line program pays off enormously: it simplifies code maintenance and ensures that the program is applicable wherever a C compiler exists. The easy web interface to DSSR makes the program universally accessible.
Aside from its extensive functionality for RNA structural analyses, DSSR also introduces a consistent and flexible way to process command-line options. Here, each option can be specified via a --key[=value]
pair (or -key[=value]
or key[=value]
; i.e., two/one/zero preceding dashes are all accepted), key
can be in either lower, UPPER or MiXed case, and value
is optional for Boolean switches. Furthermore, options can be put in any order; if the same key
is repeated more than once, the value
specified last overwrites corresponding previous settings.
As always, the rules are best illustrated with concrete examples. Some typical use-cases are given below:
#1 analyze PDB entry '1msy', with default output to stdout
x3dna-dssr --input=1msy.pdb
#2 same as #1, with output directed to file '1msy.out'
x3dna-dssr --input=1msy.pdb --output=1msy.out
#3-6, same as #2
x3dna-dssr --output=1msy.out --input=1msy.pdb
x3dna-dssr --OUTPUT=1msy.out --Input=1msy.pdb
x3dna-dssr -output=1msy.out input=1msy.pdb
x3dna-dssr output=1msy.out --input=1msy.pdb
#7 the value '1ehz.pdb' overwrites '1msy.pdb'
x3dna-dssr --input=1msy.pdb input=1ehz.pdb
#8-12 with the switch --more set to true
x3dna-dssr -input=1msy.pdb --more
x3dna-dssr -input=1msy.pdb --more=true
x3dna-dssr -input=1msy.pdb --more=yes
x3dna-dssr -input=1msy.pdb --more=on
x3dna-dssr -input=1msy.pdb --more=1
#13 same as without specifying --more,
# or with values set to false/no/0
x3dna-dssr -input=1msy.pdb --more=off
#14 shorthand forms for --input and --output
x3dna-dssr -i=1msy.pdb -o=1msy.out
#15 it can also be more verbose
x3dna-dssr --input-pdb-file=1msy.pdb
#16-18 within a key, separator dash(-) and underscore (_)
# are treated the same, and can be omitted
x3dna-dssr -i=1msy.pdb -non-pair
x3dna-dssr -i=1msy.pdb -non_pair
x3dna-dssr -i=1msy.pdb -nonpair
By allowing for 2/1/0 dashes to precede each key
and a dash/underscore character or none to separate words within the key
, DSSR provides users with great flexibility in specifying command-line options to fit into their preferred styles. Not surprisingly, new programs to be added into 3DNA, or the version 3 release of the software will all follow the same convention.
In addition to the five canonical bases (A, C, G, T, and U), nucleic acid structures in the PDB contains numerous modified variants (natural or engineered) in the nucleobase, sugar, or the phosphate. For instance, the 76-nt (nucleotide) long yeast phenylalanine tRNA (1ehz) contains 14 modified bases: 2MG10, H2U16, H2U17, M2G26, OMC32, OMG34, YYG37, PSU39, 5MC40, 7MG46, 5MC49, 5MU54, PSU55, and 1MA58. Among which, the most prevalent and best-known example is pseudouridine. Note that in the PDB, each residue (including modified nt) is named with an up to three-letter identifier, e.g., PSU for pseudouridine. For a comprehensive list (with chemical and structural information) of small molecules, including modified nts, please refer to the Ligand Expo website hosted by the RCSB PDB.
Given the widespread occurrences of modified bases in nucleic acid structures, any practical structural bioinformatics software should be able to treat them effectively, as with the canonical bases. In 3DNA, from the very beginning, modified bases are mapped to standard counterparts, e.g. 5‐iodouracil (5IU) to uracil (U) and 1‐methyladenine (1MA) to adenine (A), allowing for easy analysis of unusual DNA and RNA structures (see the NAR03 reference). Specifically, in the 3DNA distribution the file baselist.dat
contains the mappings explicitly.
As of v2.1, 3DNA automatically maps a new modified base not available in the file baselist.dat
. Yet, I have continuously updated the list in line with new DNA/RNA entries released by the PDB. The process is automated with a Ruby script which calls find_pair -s
on each nucleic-acid-containing structure to output unknown bases. As an extreme, the baselist.dat
file below comprises only canonical bases:
A A
C C
G G
T T
U U
DA A
DC C
DG G
DT T
With the above minimum mapping list, running the command find_pair -s
on 1ehz.pdb identifies all the 14 modified bases. A sample case for 2MG is shown below:
Match '2MG' to 'g' for residue 2MG 10 on chain A [#10]
check it & consider to add line '2MG g' to file <baselist.dat>
By parsing the output of a batch run on all DNA/RNA-containing entries in the PDB as of October 18, 2013, I identified a total of 596 modified bases. The top portion is as below:
02I a
08Q c
08T a
0AD g
0C c
0DC c
0DG g
0DT t
0G g
0KL u
0KX c
0KZ t
An explicit list of base mapping makes the correspondence transparent, and helps avoid ambiguous cases as to which canonical base a modified nt matches to. DSSR uses the same list internally. Hopefully, the information would also be useful to other related projects.
Recently I was a bit surprised to find that the methyl group is named differently in the PDB: C7
in DT8 (thymine) of B-DNA 355d, CM5
in 5MC40 (5-methylated C) of tRNA 1ehz, and C5M
in 5MU54 (5-methylated U, i.e., T) of the same tRNA 1ehz. See the three figures below for details.
I know that the previously named C5M of thymine in DNA has been renamed C7 as a result of the 2007 remediation effort (PDB v3). However, browsing through the wwPDB Remediation website and reading carefully the article Remediation of the protein data bank archive, I failed to see explanations of the obvious inconsistency of CM5 (5MC40) vs C5M (5MU54) in the nomenclature of the 5-methyl group in the same tRNA entry 1ehz, except for the following note:
As with the Chemical Component Dictionary, names for standard amino acids and nucleotides follow IUPAC recommendations (10) with the exception of the well-established convention for C-terminal atoms OXT and HXT. These nomenclature changes have been applied to standard polymeric chemical components only.
5-methyl is named C7 in DT8 of the DNA entry 355d
5-methyl is named CM5 in 5MC40 of the RNA entry 1ehz
5-methyl is named C5M in 5MU54 of the RNA entry 1ehz
Am I missing something obvious? If you have any further information, please leave a comment. Whatever the case, it helps (at least won’t hurt) to know the naming discrepancy for those who care about the small methyl group in nucleic acid structures.
Recently, I upgraded my local ViennaRNA package installation from v2.0.7 to v2.1.3 on my Mac. Following Quickstart in the INSTALL
file, I ran ./configure
successfully, but make
aborted with error messages. Since I previously had a working copy of the software, it must be configuration issues when I compiled this new version. After a few iterations of checking the error message and reading through the INSTALL
file, I came up with the following settings:
./configure --disable-openmp --without-perl
make
sudo make install
Apart from some warning messages, the above make
command ran successfully.
This post serves mainly as a note for my own reference. Hopefully, the information may prove useful to others who try to install the versatile ViennaRNA package on a Mac OS X machine.