A video overview of DSSR
DSSR (Dissecting the Spatial Structure of RNA) is an integrated software tool for the analysis/annotation, model building, and schematic visualization of 3D nucleic acid structures (see the figures below and the video overview). It is built upon the well-known, tested, and trusted 3DNA suite of programs. DSSR has been made possible by the developer’s extensive user-support experience, detail-oriented software engineering skills, and expert domain knowledge accumulated over two decades. It streamlines tasks in RNA/DNA structural bioinformatics, and outperforms its ‘competitors’ by far in terms of functionality, usability, and support.
Wide citations. DSSR has been widely cited in scientific literature, including: (i) “Selective small-molecule inhibition of an RNA structural element” (Nature, 2015; Merck Research Laboratories), (ii) “The structure of the yeast mitochondrial ribosome” (Science, 2017), (iii) “RNA force field with accuracy comparable to state-of-the-art protein force fields” (PNAS, 2018; D. E. Shaw Research), (iv) “Predicting site-binding modes of ions and water to nucleic acids using molecular solvation theory” (JACS, 2019), (v) “RIC-seq for global in situ profiling of RNA-RNA spatial interactions” (Nature, 2020), and (vi) “DNA mismatches reveal conformational penalties in protein-DNA recognition” (Nature, 2020).
Broad integrations. To make DSSR as widely accessible as possible, I have initiated collaborations with the principal developers of Jmol and PyMOL. The DSSR-Jmol and DSSR-PyMOL integrations bring unparalleled search capabilities (e.g., ‘select junctions’ for all multi-branch loops) and innovative visualization styles into 3D nucleic acid structures. DSSR has also been adopted into numerous other structural bioinformatics resources, including: (i) URS, (ii) RiboSketch, (iii) RNApdbee, (iv) forgi, (v) RNAvista, (vi) VeriNA3d, (vii) RNAMake, (viii) ElTetrado, (ix) DNAproDB, (x) LocalSTAR3D, (xi) IPANEMAP, and (xii) RNANet.
Advanced features. DSSR may be licensed from Columbia University. DSSR Pro is the commercial version. It has more functionalities than DSSR basic (the free academic version), including: (i) homology modeling via in silico base mutations, a feature employed by Merck scientists, (ii) easy generation of regular helical models, including circular or super-helical DNA (see figures below), (iii) creation of customized structures with user-specified base sequences and rigid-body parameters, (iv) efficient processing of molecular dynamics (MD) trajectories, (v) detailed characterization of DNA-protein or RNA-protein spatial interactions, and (vi) template-based modeling of DNA-protein complexes (see figures below). DSSR Pro supersedes 3DNA. It integrates the disparate analysis and modeling programs of 3DNA under one umbrella, and offers new advanced features, through a convenient interface. For example, with the mutate module of DSSR Pro, one can automatically perform the following tasks: (i) mutate all bases to Us, (ii) mutate bases in hairpin loops to Gs, and (iii) mutate G–C Watson-Crick pairs to C–G, and A–U to U–A. Moreover, DSSR Pro includes an in-depth user manual and one-year technical support from the developer.
Quality control. DSSR is a solid software product that excels in RNA structural bioinformatics. It is written in strict ANSI C, as a single command-line program. It is self-contained, with zero runtime dependencies on third-party libraries. The binary executables for macOS, Linux, and Windows are just ~2MB. DSSR has been extensively tested using all nucleic-acid-containing structures in the PDB. It is also routinely checked with Valgrind to avoid memory leaks. DSSR requires no set up or configuration: it simply works.
Theoretical models of G-quadruplexes, created using DSSR Pro.
Template-based modeling of DNA-protein complexes using DSSR Pro.
Here are two chromatin-like models using PDB entry 4xzq as the template.
Circular DNA duplexes modeled using DSSR Pro.
DNA super helices modeled using DSSR Pro.
Innovative cartoon-block schematics enabled by the DSSR-PyMOL integration for six representative PDB entries. Watson-Crick pairs are shown as long blocks with minor-groove edges in black (A, B), G-tetrads represented as square blocks and the metal ion as sphere ©, the ligand rendered as balls-and-sticks (D), and proteins depicted as purple cartoons (E, F). Color code for base blocks: A, red; C, yellow; G, green; T, blue; U, cyan; G-tetrad, green; WC-pairs, per base in the leading strand. Visit http://skmatic.x3dna.org.
Recommended in Faculty Opinions: “simple and effective”, “Good for Teaching”.
Employed by the NDB to create cover images of the RNA Journal.
Recently, a 3DNA user asked on the Forum about how to perform mutations to 3-methyladenine. The user reported that the procedure described in the FAQ entry How can I mutate cytosine to 5-methylcytosine did not work for the case of 3-methyladenine. This ‘limitation’ is easily understandable: the 3DNA mutate_bases
program must have knowledge of the target base, 3-methyladenine, to perform the mutation properly. The program works for the most common 5-methylcytosine mutations since the corresponding 5MC file (Atomic_5MC.pdb
, in the standard base-reference frame) is already included within the 3DNA distribution. By supplying a similar file for the target base, mutate_bases
runs the same for mutations to 5-methylcytosine (or other bases). This blog post outlines the procedure, using 3-methyladenine as an example.
A ligand name search for 5-methylcytosine on the RCSB PDB led to only two matched entries: 2X6F and 3MAG. The ligand id is 3MA. Since 3MAG has a better resolution (1.8 Å) than 2X6F (3.3 Å), its 3MA ligand was extracted from the corresponding PDB file (3MAG.pdb
). The atomic coordinates, excluding those for the two hydrogens, are as below. Note that the 3-methyl carbon atom is named CN3
.
HETATM 2960 N9 3MA A 600 16.587 14.258 22.170 1.00 49.87 N
HETATM 2961 C4 3MA A 600 17.123 13.100 21.622 1.00 50.46 C
HETATM 2962 N3 3MA A 600 16.877 11.811 22.009 1.00 50.37 N
HETATM 2963 CN3 3MA A 600 15.983 11.363 23.063 1.00 50.41 C
HETATM 2964 C2 3MA A 600 17.590 10.968 21.241 1.00 50.11 C
HETATM 2965 N1 3MA A 600 18.422 11.217 20.224 1.00 49.27 N
HETATM 2966 C6 3MA A 600 18.627 12.484 19.858 1.00 48.99 C
HETATM 2967 N6 3MA A 600 19.426 12.709 18.829 1.00 46.12 N
HETATM 2968 C5 3MA A 600 17.949 13.503 20.593 1.00 49.89 C
HETATM 2969 N7 3MA A 600 17.929 14.900 20.488 1.00 49.84 N
HETATM 2970 C8 3MA A 600 17.113 15.286 21.434 1.00 49.58 C
After running the 3DNA utility program std_base
with options -fit -A
, the corresponding atomic coordinates of 3MA are transformed to the standard base reference frame of adenine. The file must be named Atomic_3MA.pdb
, and it has the following contents:
HETATM 1 N9 3MA A 1 -1.287 4.521 0.006 1.00 49.87 N
HETATM 2 C4 3MA A 1 -1.262 3.133 0.004 1.00 50.46 C
HETATM 3 N3 3MA A 1 -2.337 2.286 -0.009 1.00 50.37 N
HETATM 4 CN3 3MA A 1 -3.743 2.648 -0.047 1.00 50.41 C
HETATM 5 C2 3MA A 1 -1.905 1.013 0.001 1.00 50.11 C
HETATM 6 N1 3MA A 1 -0.662 0.520 0.004 1.00 49.27 N
HETATM 7 C6 3MA A 1 0.366 1.372 -0.003 1.00 48.99 C
HETATM 8 N6 3MA A 1 1.588 0.867 -0.034 1.00 46.12 N
HETATM 9 C5 3MA A 1 0.068 2.768 0.003 1.00 49.89 C
HETATM 10 N7 3MA A 1 0.875 3.914 -0.003 1.00 49.84 N
HETATM 11 C8 3MA A 1 0.026 4.909 -0.003 1.00 49.58 C
Note that in file Atomic_3MA.pdb
, (1) the z-coordinates of the base atoms are close to zeros, (2) the ordering of atoms is as in the original ligand of 3MA shown above.
With Atomic_3MA.pdb
in place (in the current working directory, or the $X3DNA/config
folder), one can perform 3-methyladenine mutations using mutate_bases
. For illustration purpose, let’s generate a B-form DNA with base sequence GACATGATTGCC using the 3DNA fiber
program:
fiber -seq=GACATGATTGCC fiber-BDNA.pdb
To mutate A7 to 3MA, one needs to run mutate_bases
as following:
mutate_bases "chain=A s=7 m=3MA" fiber-BDNA.pdb fiber-BDNA-A7to3MA.pdb
The result of the mutation is shown in the figure below. Note that the backbone has identical geometry as that before the mutation, and the mutated 3MA-T pair has exactly the same parameters (propeller/buckle etc) as the original A-T. These are the two defining features of the 3DNA mutate_bases
program.
Please see the thread mutations to 3-methyladenine on the 3DNA Forum to download files fiber-BDNA.pdb
and fiber-BDNA-A7to3MA.pdb
.
Over the past couple of months, I’ve further enhanced the DSSR-derived structural features for Q-quadruplexes (G4). One was the implementation of the single descriptor of intramolecular canonical G4 structures with three connecting loops recently proposed by Dvorkin et al. The descriptor contains the number of guanines in the G4 stem, the type and relative direction of loops linking G-tracts of the stem, and the groove-widths associated with lateral loops. For example, PDB entry 2GKU (see the DSSR-enabled PyMOL schematic image below, Fig. 1A) has the following DSSR output.
List of 1 G4-stem
Note: a G4-stem is defined as a G4-helix with backbone connectivity.
Bulges are also allowed along each of the four strands.
stem#1[#1] layers=3 INTRA-molecular loops=3 descriptor=3(-P-Lw-Ln) note=hybrid-1(3+1) UUDU anti-parallel
1 glyco-bond=ss-s groove=-wn- mm(<>,outward) area=14.24 rise=3.58 twist=16.8 nts=4 GGGG A.DG3,A.DG9,A.DG17,A.DG21
2 glyco-bond=--s- groove=-wn- pm(>>,forward) area=13.12 rise=3.71 twist=25.9 nts=4 GGGG A.DG4,A.DG10,A.DG16,A.DG22
3 glyco-bond=--s- groove=-wn- nts=4 GGGG A.DG5,A.DG11,A.DG15,A.DG23
strand#1 U DNA glyco-bond=s-- nts=3 GGG A.DG3,A.DG4,A.DG5
strand#2 U DNA glyco-bond=s-- nts=3 GGG A.DG9,A.DG10,A.DG11
strand#3 D DNA glyco-bond=-ss nts=3 GGG A.DG17,A.DG16,A.DG15
strand#4 U DNA glyco-bond=s-- nts=3 GGG A.DG21,A.DG22,A.DG23
loop#1 type=propeller strands=[#1,#2] nts=3 TTA A.DT6,A.DT7,A.DA8
loop#2 type=lateral strands=[#2,#3] nts=3 TTA A.DT12,A.DT13,A.DA14
loop#3 type=lateral strands=[#3,#4] nts=3 TTA A.DT18,A.DT19,A.DA20
The descriptor=3(-P-Lw-Ln) means that the G4 structure has three layers of G-tetrads, connected via three loops: the first is the Propeller loop in anti-clockwise (negative) direction, then the Lateral loop passing a wide groove anti-clockwise, and finally another Lateral loop passing a narrow groove, also anti-clockwise. The DSSR symbols follow those of Dvorkin et al. but with capital letters L, P, and D for lateral, propeller, and diagonal loops instead of lower case letters (l, p, d) to avoid using subscript for groove-width info. So the 2GKU descriptor 3(-P-Lw-Ln) from DSSR corresponds to 3(-p-lw-ln) of Dvorkin et al.
The DSSR-enabled, PyMOL-rendered, block image in Fig. 1A makes the three G-tetrad layers (squared green blocks) immediately obvious. Other base identities and stacking interactions also become clear — for example, the A24 (in red) stacks on the top G-tetrad, and T1-A20 pair stacks with the bottom G-tetrad.
Two other PDB entries (2LOD and 2KOW) are illustrated in Fig. 1B and Fig. 1C. They have different topologies than 2GKU (Fig. 1A). DSSR is able to characterize all of them consistently.
Figure 1. DSSR-enabled, PyMOL-rendered, block images of five G-quadruplexes. A in red, C in yellow, G (and G-tetrad) in green, and T in blue.
Another G4-related new feature in DSSR is the detection of V-shaped loops in noncanonical G4 structures where one of the four G-G columns (strands) that link adjacent G-tetrads is broken. Two of recent PDB examples with V-loops are shown in Fig. 1D (5ZEV) and Fig. 1E (6H1K). An excerpt of DSSR output for the PDB entry 6H1K is shown below.
List of 1 G4-helix
Note: a G4-helix is defined by stacking interactions of G4-tetrads, regardless
of backbone connectivity, and may contain more than one G4-stem.
helix#1[1] stems=[#1] layers=3 INTRA-molecular
1 glyco-bond=-sss groove=w--n mm(<>,outward) area=12.76 rise=3.47 twist=18.2 nts=4 GGGG A.DG2,A.DG19,A.DG15,A.DG26
2 glyco-bond=s--- groove=w--n pm(>>,forward) area=12.84 rise=3.07 twist=33.4 nts=4 GGGG A.DG1,A.DG20,A.DG16,A.DG27
3 glyco-bond=s--- groove=w--n nts=4 GGGG A.DG25,A.DG21,A.DG17,A.DG28
strand#1 DNA glyco-bond=-ss nts=3 GGG A.DG2,A.DG1,A.DG25
strand#2 DNA glyco-bond=s-- nts=3 GGG A.DG19,A.DG20,A.DG21
strand#3 DNA glyco-bond=s-- nts=3 GGG A.DG15,A.DG16,A.DG17
strand#4 DNA glyco-bond=s-- nts=3 GGG A.DG26,A.DG27,A.DG28
****************************************************************************
List of 1 G4-stem
Note: a G4-stem is defined as a G4-helix with backbone connectivity.
Bulges are also allowed along each of the four strands.
stem#1[#1] layers=2 INTRA-molecular loops=3 descriptor=2(D+PX) note=UD3(1+3) UDDD anti-parallel
1 glyco-bond=s--- groove=w--n mm(<>,outward) area=12.76 rise=3.47 twist=18.2 nts=4 GGGG A.DG1,A.DG20,A.DG16,A.DG27
2 glyco-bond=-sss groove=w--n nts=4 GGGG A.DG2,A.DG19,A.DG15,A.DG26
strand#1 U DNA glyco-bond=s- nts=2 GG A.DG1,A.DG2
strand#2 D DNA glyco-bond=-s nts=2 GG A.DG20,A.DG19
strand#3 D DNA glyco-bond=-s nts=2 GG A.DG16,A.DG15
strand#4 D DNA glyco-bond=-s nts=2 GG A.DG27,A.DG26
loop#1 type=diagonal strands=[#1,#3] nts=12 GAGGCGTGGCCT A.DG3,A.DA4,A.DG5,A.DG6,A.DC7,A.DG8,A.DT9,A.DG10,A.DG11,A.DC12,A.DC13,A.DT14
loop#2 type=propeller strands=[#3,#2] nts=2 GC A.DG17,A.DC18
loop#3 type=diag-prop strands=[#2,#4] nts=5 GACTG A.DG21,A.DA22,A.DC23,A.DT24,A.DG25
****************************************************************************
List of 2 non-stem G4 loops (INCLUDING the two terminal nts)
1 type=lateral helix=#1 nts=5 GACTG A.DG21,A.DA22,A.DC23,A.DT24,A.DG25
2 type=V-shaped helix=#1 nts=4 GGGG A.DG25,A.DG26,A.DG27,A.DG28
Note that here a new loop type (diag-prop
) and topology description symbol (X
) are introduced. In developing DSSR in general, and G4-related features in particular, I’ve always tried to follow conventions widely used by the community. Whereas inconsistency exists, I pick up the ones that are in line with other parts of DSSR. For unique DSSR features lacking outside references, I came up my own nomenclature. When DSSR becomes more widely used, it may serve to standardize G4 nomenclatures.
From early on, the --json
and --nmr
options in DSSR have provided a convenient means to analyze an ensemble of solution NMR structures in the standard PDB/mmCIF format, as those available from the Protein Data Bank (PDB). The usage is very simple, as shown below for the PDB entry 2lod. The parameters for each model can be easily parsed from the output JSON stream.
x3dna-dssr -i=2lod.pdb --nmr --json
A practical example of the DSSR JSON/NMR usage for the analysis of RNA backbone torsion angles can be found on the 3DNA Forum.
While not a practitioner of molecular dynamics (MD) simulations, I’ve regularly followed the relevant literature. I know of the popular tools such as MDanalysis, MDTraj, and CPPTRAJ that are dedicated to analyze trajectories of MD simulations. I understand the subtleties MD may have, and I’m also sure of the unique features DSSR has to offer. By design, I made the DSSR interface to MD straightforward, by simply following commonly-used standard data formats: the MODEL/ENDMDL delineated PDB (or the PDBx/mmCIF) format for input, and JSON for output. Overall, I had expected that DSSR would complement the dedicated tools (e.g., MDanalysis, MDTraj, and CPPTRAJ) for MD analysis.
Over the years, DSSR has gradually gained recognition in the MD field. At a meeting, I once heard of a user complaining that DSSR is too slow for the analysis of millions of frames of MD simulations. Yet, without access to a large MD dataset and direct collaborations from a user, the speed issue could not be pursued further. In my experience, I knew DSSR is fast enough for the analysis of NMR ensembles from the PDB. This situation has completely changed recently, after a user reported on the 3DNA Forum on the slowness of DSSR on MD analysis.
Do you have an idea why the backbone parameter for a nucleic acids are so much faster calculated with do_x3dna
than with DSSR? Analyzing a trajectory with 100k frames take for a native structure approx. 2 hours with do_x3dna. A native RNA structure with DSSR will take approx. 10 days (10k frames/day). I need to run DSSR, because my system contains an abasic site.
With the above and follow-up information provided, I was able to fix the DSSR algorithm for parsing MD trajectories, among other things. Now DSSR reads a trajectory sequentially frame-by-frame at constant speed. The same 100K frames takes 36 minutes to finish instead of 10 days, which is an increase of 10*24*60/36=400 times. This 100x speedup was later on verified when I tested DSSR on the 1000-structure trajectory the user supplied.
So as of v1.7.8-2018sep01, DSSR is quick enough for real-world applications on MD analysis. In the releases of DSSR afterwards, I’ve further polished the code and added some new features. All things considered, DSSR is bound to become more relevant in the active MD field in the years to come.
By the way, for those who do not like the --nmr
option, --md
or --ensemble
also works. These three alternatives are equivalent to DSSR internally.
As mentioned in the blog post Integrating DSSR into Jmol and PyMOL,
“The small size, zero configuration, extensive features, and robust performance make DSSR ideal to be integrated into other bioinformatics tools.” In addition to the DSSR-Jmol and DSSR-PyMOL integrations which I initiated and got personally involved, other bioinformatics resources are increasingly taking advantage of what DSSR has to offer. Here are a few examples:
Before aligning structures, STAR3D preprocesses PDB files with base-pairing annotation using either MC-Annotate (Gendron et al., 2001; Lemieux and Major, 2002) (for PDB inputs) or DSSR (Lu et al., 2015) (for PDB and mmCIF inputs) and pseudo-knot removal using RemovePseudoknots (Smit et al., 2008).
2014, RNApdbee: In order to facilitate a more comprehensive study, the webserver integrates the functionality of RNAView, MC-Annotate and 3DNA/DSSR, being the most common tools used for automated identification and classification of RNA base pairs.
2018, RNApdbee 2.0: Base pairs can be identified by 3DNA/DSSR (default) (4), RNAView (5), MC-Annotate (3) or newly added FR3D (15).
- The Universe of RNA Structures (URS) web-interface to the URS database (URSDB) makes extensive use of DSSR. For each analyzed structure (including PDB entries), the DSSR text output file (termed “DSSR-file”) is also available. Impressively, the maintainers of URS are quick with DSSR updates. The current version used by the URS website is DSSR v1.7.4-2018jan30.
Forty years after the yeast phenylalanine tRNA structure was solved, modified nucleotides should no longer be an issue for RNA structural analysis, especially for this classic molecule. Automatic processing of modified nucleotides is just one aspect of DSSR’s substantial set of features. Based on my understanding of the field, more structural bioinformatics resources/tools could benefit from DSSR. Simply put, if one’s project is related to 3D DNA or RNA structures, DSSR may be of certain help. It’s just a timing issue that DSSR would benefit a (much) larger community.
When visiting the RCSB PDB website today, I am please to notice that the PDB now contains “10015 Nucleic Acid Containing Structures”. Based on “Macromolecule Type” in “Advanced Search” of the RCSB PDB website, I observed the following information:
- The number of DNA-containing structures is
6,384
(reported in 2,997
papers), and the corresponding number for RNA-containing structures is 3,861
(associated with 2,012
publications).
- There are
4,570
structures containing both DNA and protein (potentially forming DNA-protein complexes), and 2,478
RNA-protein complexes.
- The smallest nucleic-acid-containg structures have only two nucleotides (e.g., 3rec), and largest ones are the ribosomes (and virus particles).
- The earliest released DNA structure from the PDB is 1zna (on March 18, 1981), a Z-DNA tetramer. The earliest RNA structure released is 4tna (on April 12, 1978), a refined structure of the yeast phenylalanine transfer RNA.
This landmark achievement is made possible by the world-wide scientific community through decades of efforts solving DNA/RNA 3D structures via experimental approaches (mainly solution NMR, x-ray crystal, and cryo-EM). These over 10K nucleic acid structures present both challenges and opportunities for the field of structural bioinformatics, especially for intricate RNA molecules. DSSR is an integrated software tool for dissecting the spatial structure of RNA. It is my effort in addressing the challenging issues for the analysis/annotation and visualization of RNA structures.
It is textbook knowledge that the Watson-Crick (WC) pairs are specific, forming only between A and T/U (A–T/U or T/U–A) or G and C (G–C or C–G). Furthermore, an A only forms one WC pair with a T, so is G vs. C. The widely used dot-bracket-notation (DBN) of DNA/RNA secondary structure depends crucially on this feature of specificity and uniqueness, by using matched parentheses to represent WC pairs, such as ((....))
for a GCGA (GNRA-type) tetra-loop of sequence GCGCGAGC
.
The reality is more complicated, even for what’s presumably to be a ‘simple’ question of deriving RNA secondary structure from 3D coordinates in PDB. One subtlety is related to the ambiguity of atomic coordinates that renders one base apparently forming two WC pairs with two other complementary bases. As always, the case can be best illustrated with a concrete example. The image shown below is taken from PDB entry 1qp5 where C20 (on chain B) forms two WC pairs, each with G4 and G5 (on chain A) respectively.
Clearly, taking both as valid WC G–C pairs would make the resultant DBN illegitimate. DSSR resolves such discrepancies by taking structural context into consideration to ensure that one base can only have a WC pair with another base. Here the G5–C20 WC pair is retained whilst the G4–C20 WC is removed.
This issue, one base can form two WC pairs as derived from the PDB, has been noticed for a long while. Two examples from literature are shown below:
The crystal structure data files were downloaded from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (Berman et al. 2000). For each crystal structure, the set of canonical base pairs was extracted by selecting all Watson–Crick and standard G-U wobble pairs found by RNAview (Yang et al. 2003). Occasional conflicts in this list, where RNAview annotates two bases, x and y, as a standard base pair and also y and z as another conflicting base pair, were removed manually by visual inspection of the crystal structure in the program PyMOL (http://pymol.sourceforge.net/). The helix-extension data set was created by taking the canonical pairs and adding all additional base–base interactions identified by RNAview (excluding stacked bases and tertiary interactions) for which the direct neighbor was already in the collection. This means each base pair (i,j) was added if both i and j were still unpaired and if either (i + 1, j – 1) or (i –1, j + 1) were already in the set.
… From these complexes, we retrieved all RNA chains also marked as non-redundant by RNA3DHub. Each chain was annotated by FR3D. Because FR3D cannot analyze modified nucleotides or those with missing atoms, our present method does not include them either. If several models exist for a same chain, the first one only was considered. For the rest of this paper, the base pairs extracted from the FR3D annotations are those defined in the Leontis–Westhof geometric classification (24).
For each chain a secondary structure without pseudoknots was deduced from the annotated interactions, as follows. First all canonical Watson–Crick and wobble base pairs (i.e. A-U, G-C and G-U) were identified. Then, since many structures are naturally pseudoknotted, we used the K2N (25) implementation in the PyCogent (26) Python module to remove pseudoknots. Problems arise when a nucleotide is involved in several Watson–Crick base pairs (which is geometrically not feasible), probably due to an error of the automatic annotation. Those discrepancies were removed with a ad hoc algorithm such that if a nucleotide is involved in several Watson–Crick base pairs, we remove the base pair which belongs to the shortest helix.
By design, DSSR takes care of these ‘little details’, among other handy features (such as handling modified nucleotides and removing pseudoknots). By providing a robust infrastructure and comprehensive framework, DSSR allows users to focus on their research topics. If you have experience with other tools, such as RNAView and FR3D cited above, give DSSR a try: it may fit your needs better.
An article titled Simulations and electrostatic analysis suggest an active role for DNA conformational changes during genome packaging by bacteriophages has recently been published in bioRxiv. I was honored to have the opportunity collaborating with fellow researchers from University of Pennsylvania and Thomas Jefferson University in this significant piece of work.
Here is the abstract. Please download the PDF version to know more.
Motors that move DNA, or that move along DNA, play essential roles in DNA replication, transcription, recombination, and chromosome segregation. The mechanisms by which these DNA translocases operate remain largely unknown. Some double-stranded DNA (dsDNA) viruses use an ATP-dependent motor to drive DNA into preformed capsids. These include several human pathogens, as well as dsDNA bacteriophages (viruses that infect bacteria). We previously proposed that DNA is not a passive substrate of bacteriophage packaging motors but is, instead, an active component of the machinery. Computational studies on dsDNA in the channel of viral portal proteins reported here reveal DNA conformational changes consistent with that hypothesis. dsDNA becomes longer (“stretched”) in regions of high negative electrostatic potential, and shorter (“scrunched”) in regions of high positive potential. These results suggest a mechanism that couples the energy released by ATP hydrolysis to DNA translocation: The chemical cycle of ATP binding, hydrolysis and product release drives a cycle of protein conformational changes. This produces changes in the electrostatic potential in the channel through the portal, and these drive cyclic changes in the length of dsDNA. The DNA motions are captured by a coordinated protein-DNA grip-and-release cycle to produce DNA translocation. In short, the ATPase, portal and dsDNA work synergistically to promote genome packaging.
An abasic site is a location in DNA or RNA where a purine or pyrimidine base is missing. It is also termed an AP site (i.e., apurinic/apyrimidinic site) in biochemistry and molecular genetics. The abasic site can be formed either spontaneously (e.g., depurination) or due to DNA damage (occurring as intermediates in base excision repair). According to Wikipedia, “It has been estimated that under physiological conditions 10,000 apurinic sites and 500 apyrimidinic may be generated in a cell daily.”
In DSSR and 3DNA v2.x, nucleotides are recognized using standard atom names and base planarity. Thus, abasic sites are not taken as nucleotides (by default), simply because they do not have base atoms. DSSR introduced the --abasic
option to account for abasic sites, a feature useful for detecting loops with backbone connectivity.
For example, by default, DSSR identifies one internal loop (no. 1 in the list below) in PDB entry 1l2c. With the --abasic
option, two internal loops (including the one with the abasic site C.HPD18, no. 2) are detected.
List of 2 internal loops
1 symmetric internal loop: nts=6; [1,1]; linked by [#-1,#1]
summary: [2] 1 1 [B.1 C.24 B.3 C.22] 1 4
nts=6 GTATAC B.DG1,B.DT2,B.DA3,C.DT22,C.DA23,C.DC24
nts=1 T B.DT2
nts=1 A C.DA23
2 symmetric internal loop: nts=6; [1,1]; linked by [#1,#2]
summary: [2] 1 1 [B.6 C.19 B.8 C.17] 4 5
nts=6 CTTA?G B.DC6,B.DT7,B.DT8,C.DA17,C.HPD18,C.DG19
nts=1 T B.DT7
nts=1 ? C.HPD18
Note that C.HPD18 in 1l2c is a non-standard residue, as shown in the HETATM records below. Since the identity of C.HPD18 cannot be deduced from the atomic records, its one-letter code is designated as ?
.
HETATM 346 P HPD C 18 -14.637 52.299 29.949 1.00 49.12 P
HETATM 347 O5' HPD C 18 -14.658 52.173 28.359 1.00 48.28 O
HETATM 348 O1P HPD C 18 -15.167 51.040 30.537 1.00 49.35 O
HETATM 349 O2P HPD C 18 -13.303 52.798 30.369 1.00 46.43 O
HETATM 350 C5' HPD C 18 -15.703 51.469 27.687 1.00 45.70 C
HETATM 351 O4' HPD C 18 -16.364 50.501 25.561 1.00 44.15 O
HETATM 352 O3' HPD C 18 -13.990 51.738 24.335 1.00 45.75 O
HETATM 353 C1' HPD C 18 -16.105 54.187 25.684 1.00 52.47 C
HETATM 354 O1' HPD C 18 -17.309 54.085 26.496 1.00 56.16 O
HETATM 355 C3' HPD C 18 -14.756 52.250 25.426 1.00 46.23 C
HETATM 356 C4' HPD C 18 -15.263 51.093 26.291 1.00 45.72 C
HETATM 357 C2' HPD C 18 -16.030 52.889 24.898 1.00 49.05 C
In contrast, the R.U-8 in PDB entry 4ifd is a standard U, and is properly labeled by DSSR.
ATOM 26418 P U R -8 139.362 21.962 129.430 1.00208.29 P
ATOM 26419 OP1 U R -8 140.062 20.821 130.074 1.00207.30 O
ATOM 26420 OP2 U R -8 140.113 23.208 129.129 1.00208.44 O1+
ATOM 26421 O5' U R -8 138.712 21.439 128.071 1.00157.60 O
ATOM 26422 C5' U R -8 139.507 20.790 127.087 1.00155.47 C
ATOM 26423 C4' U R -8 138.843 20.804 125.731 1.00152.27 C
ATOM 26424 O4' U R -8 138.538 22.172 125.352 1.00149.29 O
ATOM 26425 C3' U R -8 139.677 20.275 124.572 1.00152.70 C
ATOM 26426 O3' U R -8 139.670 18.859 124.478 1.00155.04 O
ATOM 26427 C2' U R -8 139.053 20.969 123.369 1.00150.26 C
ATOM 26428 O2' U R -8 137.849 20.322 122.984 1.00146.83 O
ATOM 26429 C1' U R -8 138.700 22.334 123.958 1.00147.35 C
This is yet another little detail that DSSR takes care of. It is the close consideration to many such subtle points that makes DSSR different. Overall, DSSR represents my view of what a scientific software program could be (or should be).
DSSR deliberately makes a distinction between ‘stem’ and ‘helix’, as shown below:
a helix is defined by base-stacking interactions, regardless of bp type and backbone connectivity, and may contain more than one stem.
a stem is defined as a helix consisting of only canonical WC/wobble pairs, with a continuous backbone.
By definition, a helix or stem consists of at least two base-pairs with stacking interactions. Helix is more inclusive and may contain more than one stem. This differentiation between ‘helix’ and ‘stem’ naturally leads to the definition of coaxial stacking, another widely used yet vaguely specified concept.
Again, the abstract notion can be best illustrated with a concrete example. In the classic yeast phenylalanine tRNA (PDB id: 1ehz), DSSR identifies that two stems [the acceptor stem (right) and the T stem (left)] are coaxially stacked within one double helix. See the figure below.
In the above schematics cartoon-block representation, each Watson-Crick base pair is rendered as a single, long rectangular block. Base identities of the G–U wobble, and the two non-canonical pairs (left terminal) are illustrated separately, with a larger block size for purines (G and A), and a smaller size for pyrimidines (C, U, and T).
I picked up ‘stem’ as a more specialized duplex because it is widely used in the RNA stem-loop structure, and in describing the four ‘paired regions’ of the classic tRNA cloverleaf secondary structure. On the other hand, ‘helix’ is (to me at least) a more general term, and thus more inclusive. It is worth noting that other terms such as ‘arm’, ‘paired region’, or ‘helix’ etc. have also been used interchangeably in the literature to refer what DSSR designated as ‘stem’.
As a side note, the basic algorithm for identifying helixes/stems in DSSR is also applicable for detecting G-quadruplexes. The same idea of ‘helix’ or ‘stem’ also applies here (see figure below for PDB entry: 5dww). Indeed, as of v1.7.0-2017oct19, DSSR contains a new section for the identification and characterization of G-quadruplexes.
DSSR is “an integrated software tool for dissecting the spatial structure of RNA”. It excels in consolidating the diverse pieces together via a coherent framework, readily accessible in a solid software product. DSSR may well serve as a cornerstone in RNA structural bioinformatics and would facilitate communications in the broad areas related to nucleic acids structures.