Posts list

The DSSR-Jmol and DSSR-PyMOL integrations

As documented in the Overview PDF, DSSR can be easily incorporated into other structural bioinformatics pipelines. Working with Robert Hanson and Thomas Holder respectively, I initiated the integrations of DSSR into Jmol and PyMOL, two of the most popular molecular viewers. The DSSR-Jmol and DSSR-PyMOL integrations lead to unparalleled search capabilities and innovative visualization styles of 3D nucleic acid structures. They also exemplify the critical roles that a domain-specific analysis engine may play in general-purpose molecular visualization tools.

On January 27, 2016, I wrote the blogpost Integrating DSSR into Jmol and PyMOL. Four years later, these integrations have led to two peer-reviewed articles, both published in Nucleic Acids Research (NAR). This blogpost (dated 2020-09-15) highlights key features in each case and reflects on my experience in these two exciting collaborations.

The DSSR-Jmol integration

Hanson RM and Lu XJ (2017). DSSR-enhanced visualization of nucleic acid structures in Jmol. The DSSR-Jmol integration excels in its SQL-like, flexible searching capability of structural features, as demonstrated at the website This work fills a gap in RNA structural bioinformatics by enabling deep analyses and SQL-like queries of RNA structural characteristics, interactively. Here are some simple examples:

SELECT WITHIN(dssr, "nts WHERE is_modified = true") # modified nucleotides
SELECT pairs # all pairs
Select WITHIN(dssr, "pairs WHERE name = 'Hoogsteen'") # Hoogsteen pairs
SELECT WITHIN(dssr, "pairs WHERE name != 'WC'") # non-Watson-Crick pairs
SELECT junctions # all junctions loops
select within(dssr, "junctions WHERE num_stems = 4") # four-way junction loops

In a recently email communication, Bob wrote:

How are you doing? I’m smiling, because I am remembering our incredible, animated discussions and how fun it was to work together with you on Jmol and DSSR.

The DSSR-PyMOL integration

Lu XJ (2020). DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL. The DSSR-PyMOL integration brings unprecedented visual clarity to 3D nucleic acid structures, especially for G-quadruplexes. The four interfaces cover virtually all conceivable use cases. The easiest way to get started and quickly benefit from this work is via the web application at

I approached Thomas to write the DSSR-PyMOL manuscript together, in a similar way as the DSSR-Jmol paper. He wrote back, saying “I’m not interesting in being co-author of the paper”, adding:

But, if there is anything I can help you with, like revising the `` script, or proof-reading the PyMOL related parts of the manuscript, I’ll be happy to do so.

Indeed, Thomas helped in several aspects of the DSSR-PyMOL project, as acknowledged in the paper:

I appreciate Thomas Holder (PyMOL Principal Developer, Schrödinger, Inc.) for writing the DSSR plugin for PyMOL, and for providing insightful comments on the manuscript and the web application interface.

Enhanced vs Innovative

Some viewers may noticed the difference in titles of the two NAR papers: “DSSR-enhanced visualization of nucleic acid structures in Jmol” vs. “DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL”. As a matter of fact, the initial title of the DSSR-PyMOL paper was DSSR-enhanced visualization of nucleic acid structures in PyMOL, as shown in the December 02, 2019 announcement post on the 3DNA Forum.

In an era where reproducibility of “scientific” publications has become an issue and “break-throughs” are often broken or hardly held, I hesitate to use phrases such as “innovative”, “novel”, “paradigm shift” etc. Instead, I often use the modest words “refinement”, “enhance”, “improved”, “revised” etc, and try to deliver more than claimed. However, reviewers may take solid work but modest writing as “incremental” or “unexciting”. Before submitting the DSSR-PyMOL paper, I changed the title to DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL. Does it mean that the DSSR-PyMOL integration is more innovative than the DSSR-Jmol case? Not necessarily. I do have a paper with “innovative” in its title.




Recently, while reading the Miskiewicz et al. review article How bioinformatics resources work with G4 RNAs, I noticed the term DSSR-G4DB under the category Databases with G4-related data. It refers to the website (or that has been there since 2017 and weekly updated with new G-quadruplexes from the PDB. The DSSR-G4 resource, DSSR-Enabled Automatic Identification and Annotation of G-quadruplexes in the PDB, has already been cited several times in literature. However, I have not written up a paper on it yet, and thus have never thought carefully on a name for the resource. The term DSSR-G4DB sounds good to me, and I may well use it in the future.

Given below are the relevant quotations on DSSR and the DSSR-G4DB resource in the Miskiewicz et al. review article and my notes. The underlined headings (e.g., “Conclusion”) are those of the Miskiewicz et al. review article.

Methods: Databases with G4-related data

Currently, there exist 16 databases, which store information concerning quadruplexes. They fall into three categories: databases that collect primary or tertiary structures with experimentally verified G4s (DSSR-G4DB, G4IPDB, G4LDB, G4RNA, Lit392 and Lit638); databases storing data from high-throughput sequencing with mapped quadruplexes (GSE63874, GSE77282, GSE110582 and GSE129281); and databases of sequences with G4s identified in silico (Greglist, GRSDB2, G4-virus, Non-B DB v2.0, Plant-GQ and QuadBase2)

DSSR-G4DB [38] contains quadruplex nucleic acid structures found by DSSR in the Protein Data Bank [30], currently 354 entries. The data are annotated. Users can find information about G-tetrads, G4 helices and G4-stems and visualize the 3D models of G4 structures. Availability: webserver (http://g4.x3 Recent update: 5 June 2020.

Note: DSSR-G4DB is updated weekly. The latest update is on 2020-09-09, with 362 G-quadruplexes auto-curated with DSSR from the PDB.

Methods: Tools that analyze and visualize 2D and 3D structure

Currently, four tools can analyze and visualize G4 structures. DSSR [38] … ElTetrado [31] … RNApdbee [66, 69] … 3D-NuS [65]

DSSR [38] processes the 3D structure of the RNA molecule and annotates its secondary structure. It is a part of the 3DNA suite [67] designed to work with the structures of nucleic acids. DSSR identifies, classifies and describes base pairs, multiplets and characteristic motifs of the secondary structure; helices, stems, hairpin loops, bulges, internal loops, junctions and others. It can also detect modules and tertiary structure patterns, includ- ing pseudoknots and kink-turns. The recent extension, DSSR- PyMOL [68], allows drawing cartoon-block schemes of the 3D structure and responds to the need for simplified visualization of quadruplexes. Input data formats: PDB, mmCIF and PDB ID. Availability: standalone program, web application (http://dssr.x3,

Note: The other three tools all depend on or make use of DSSR and 3DNA:

  • ElTetrado “ElTetrado depends on DSSR (Lu, Bussemaker and Olson, 2015) in terms of detection of base pairing and stacking.”
  • RNApdbee uses 3DNA/DSSR as the default to identify base pairs.
  • 3D-NuS employs 3DNA for structural analysis and model building.
    “These filtrated structures (225 DNA and 166 RNA structures) have been used to derive the local base pair step and base pair parameters (Table S2 for DNA and Table S3 for RNA) using 3DNA software package [35] and are stored in the server for 3D-NuS modeling.”
    “Soon after the user submits input for sequence-specific modeling, the server fetches the appropriate base pair step and base pair parameters from the database and creates a 3DNA style input file. Subsequently, the template model is built using the rebuild module of 3DNA software package and subjected to energy optimization using X-plor [56] to remove steric hindrance, specifically in the mismatch- containing duplexes (Fig. 1).”

Results: Computational experiments with structure-based tools

DSSR and ElTetrado identified quadruplexes in the input PDB files. Both programs focused on structural aspects of the input molecule, explicitly informing about quadruplexes and tetrads within the structure. DSSR provided an extensive analysis of 3D structures and output the data about G-tetrads, G-helices and G4-stems. It computed planarity for each G-tetrad and gave the sections area, rise and twist parameters for G4-helix and G4-stems. The program automatically assigned loop topologies according to the predefined types (P—parallel, D—diagonal and L—lateral) and their orientation (+/−). DSSR-PyMOL generated block schemes of both quadruplexes (Figure 4A3 and B3). ElTetrado also calculated planarity, rise and twist parameters and identified strand directions for both quadruplexes. It classified the quadruplexes and their component tetrads to ONZ classes. Finally, it generated the arc diagram (Figure 4A1 and B1) and two-line dot-bracket encoding of every quadruplex.

Note: DSSR contains an undocumented option --G4. With the ONZ variant, i.e., --g4=onz (case does not matter), DSSR also outputs the ONZ classification of G-tetrads from the same chain.


DSSR comprehensively examines the G4 structure, determines a variety of its parameters and provides the schematic 3D view.

It is worth noting that DSSR has been categorized under “Databases with G4-related data” and “Tools that analyze and visualize 2D and 3D structure” of the Methods section. It is not a tool that predicts G4 location in the sequence. There are 14 tools listed in “Table 2. Selected features of PQS prediction tools”, including G4Hunter and QGRS Mapper etc.



Two citations to the DSSR-PyMOL NAR paper

Via Google Scholar, I noticed the following two citations to the DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL paper recently published in Nucleic Acids Research (NAR):

Here are the direct quotations on the DSSR-PyMOL paper from these two citations.

The Miskiewicz et al. paper, in Briefings in Bioinformatics (2020):

DSSR [38] processes the 3D structure of the RNA molecule and annotates its secondary structure. It is a part of the 3DNA suite [67] designed to work with the structures of nucleic acids. DSSR identifies, classifies and describes base pairs, multiplets and characteristic motifs of the secondary structure; helices, stems, hairpin loops, bulges, internal loops, junctions and others. It can also detect modules and tertiary structure patterns, including pseudoknots and kink-turns. The recent extension, DSSR-PyMOL [68], allows drawing cartoon-block schemes of the 3D structure and responds to the need for simplified visualization of quadruplexes.

DSSR-PyMOL generated block schemes of both quadruplexes (Figure 4A3 and B3).

Visualization of PDB entries 2RQJ and 6GE1

Figure 4: Visualization of (A) 2RQJ and (B) 6GE1 structures generated by (1) ElTetrado, (2) RNApdbee and (3) DSSR-PyMOL.

The Caruso et al. paper in bioRxiv (2020):

Next, the structural model of the N-NTD:dsTRS (5’–UCUAAAC–3’) complex was generated from the lowest-energy structure of the N-NTD:dsNS complex, derived from the cluster with the lowest HADDOCK score, by mutating the dsRNA sequence using w3DNA (29). Therefore, both complexes have identical geometries, varying only the dsRNA sequences. Structural conformation of the constructed model for N-NTD:dsTRS complex was displayed using the web application for easy creation of DSSR (Dissecting the Spatial Structure of RNA)-PyMOL schematics (32).

Structural model of the N-NTD:dsRNA complex and its validation from MD simulations

Figure 1: Structural model of the N-NTD:dsRNA complex and its validation from molecular dynamics simulations. (A) Structural model of the N-NTD:dsTRS complex determined by molecular docking calculations and mutation of dsNS nucleotide sequence. N-NTD is presented as purple cartoon and dsTRS is denoted as a ribbon model with base pairing as colored rectangles. The color of the rectangles corresponds to the nitrogenous base of the dsRNA sense strand, namely A: red, C: yellow, U: cyan, and G: green. The large protruding β2-β3 loop is referred to as the finger. (B) …

p(clean)=. Analysis of the intramolecular (dsRNAs) and intermolecular (N-NTD:dsRNAs) H-bonds

Figure 3: Analysis of the intramolecular (dsRNAs) and intermolecular (N-NTD:dsRNAs) hydrogen bonds. (A) … (B) … (C) Structural model of the N-NTD:dsTRS complex representative of the MD simulation for run 5. The protein is shown in purple cartoon and dsTRS is denoted as a ribbon model with nitrogenous bases and base-pairing as colored squares and rectangles, respectively. The color of the squares corresponds to the type of nitrogenous base, namely A: red, C: yellow, U: cyan, and G: green, while the rectangles refer to the nitrogenous base color of the dsRNA sense strand.

It is really a pleasure to see the DSSR-PyMOL paper being cited quickly after its publication. I am always curious to see how DSSR is cited in literature. Indeed, over the years following citations to DSSR has become an effective way for me to become informed of directly relevant references. Reading these citing articles motivates me to further improve DSSR.



DSSR-PyMOL schematics recommended in Faculty Opinions

Recently, while visiting the NAR website on DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL, I noticed a big red circle near “View Metrics”. I was quite curious to see what it meant. After a few clicks, I was delighted to read the following recommendation in Faculty Opinions by Quentin Vicens:

I really enjoyed “playing” with the revised and expanded version of Dissecting the Spatial Structure of RNA (DSSR) described by Xiang-Jun Lu in this July issue of NAR. The software is known to generate ‘block view’ representations of nucleic acids that make many parameters more immediately visible, such as base composition, stacking, and groove depth. This new version includes Watson-Crick pairs shown as single rectangles, and G quadruplexes as large squares, making such regions more quickly distinguishable from other regions within an overall tertiary structure. I was amazed at how simple and effective the web interface was, and I liked the possibility to download a PyMOL session to look at molecules under different angles. If need be, blocks can be further edited in PyMOL using the provided plugin (see on page 35). I highly recommend it!

The DSSR-PyMOL schematics paper/website has been rated “Very Good”, and classified as “Good for Teaching”. See Vicens Q: Faculty Opinions Recommendation of [Lu XJ, Nucleic Acids Res 2020 48(13):e74]. In Faculty Opinions, 14 Aug 2020; 10.3410/f.738001682.793577327.



DSSR 2.0 is licensed by Columbia University

DSSR 2.0 is out. It integrates an unprecedented set of features into one computational tool, including analysis/annotation, schematic visualization, and model building of 3D nucleic acid structures. DSSR 2.0 supersedes 3DNA 2.4, which is still maintained but no additional features other than bug fixes are scheduled. See the DSSR 2.0 overview PDF.

DSSR delivers a great user experience by solving problems and saving time. Considering its usability, interoperability, features, and support, DSSR easily stands out among `competitors’. It exemplifies a `solid software product’. I strive to make DSSR a pragmatic tool that the structural bioinformatics community can count on.

DSSR 2.0 is licensed by Columbia University. The software remains free for academic users, with the basic user manual. The professional user manual (over 230 pages, including 7 appendices) is available for paid academic users or commercial users only. Licensing revenue helps ensure the long-term sustainability of the DSSR project.

Additionally, the paper “DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL” has recently been published in Nucleic Acids Research, 48(13):e74. Check the web interface.

The DSSR-PyMOL paper/website has been rated “very good” and classified as “Good for Teaching”. See Vicens Q: Faculty Opinions Recommendation of [Lu XJ, Nucleic Acids Res 2020 48(13):e74]. In Faculty Opinions, 14 Aug 2020; 10.3410/f.738001682.793577327



LW base-pair classifications derived using DSSR

I recently come across the article FMN riboswitch aptamer symmetry facilitates conformational switching through mutually exclusive coaxial stacking configurations by Wilt et al. in the Journal of Structural Biology: X (JSBX). In the caption to Figure S1, “Secondary structure map of the FMN riboswitch”, the authors wrote:

Base-pairing is annotated using Leontis-Westhoff nomenclature (Leontis and Westhof, 2001), derived using 3DNA-DSSR (Lu and Olson, 2003), and the map was generated using VARNA (Darty et al., 2009).

It is a nice surprise to see that 3DNA-DSSR is cited this way. The LW scheme is based on the three edges of each base with potential for H-bonding interactions (Watson-Crick, Hoogsteen, and Sugar), and the two orientations (cis or trans) of the interacting bases with respect to the glycosidic bonds. The combinations of edges and orientations (3 × 2 × 2) “gives rise to 12 basic geometric types with at least two H bonds connecting the bases” (Leontis and Westhof, 2001). This geometry-based approach captures salient features of pairing interactions and strikes a balance between simplicity and expressiveness. The LW scheme is more widely applicable than the Saenger classification, and more intuitive to biologists. As a result, the LW classification has become a standard in RNA structural bioinformatics.

However, the RNA-centric LW classification has inherent limitations. For example, the Sugar edge explicitly includes the 2′-hydroxyl group, rendering it less applicable to DNA structures. Additionally, while the aromatic base can be taken as a rigid body with three fixed edges, the χ (chi) torsion angle characterizes the internal freedom between base and sugar (anti vs. syn). When χ is in the relatively rare (but not uncommon) syn conformation (especially abundant in G-quadruplexes), the Sugar edge, defined with reference to the common anti conformation, seems to no longer exist. The rich variety of RNA pairs extends beyond the 12 basic LW types. There are numerous pairs in RNA with only one H-bond or with bifurcated H-bonds, at boundary locations where the LW classification does not strictly apply. Lemieux and Major (2002) were the first to extend the LW classification. We noted the importance of the out-of-plane ‘backbone edge’ formed by an RNA-specific H-bond between O2′(G) and OP2 (Lu et al., 2010). Finally, the RNA 3D Hub website, hosted by the Leontis-Zirbel team, lists pairing interactions that do not fall into the 12 geometric types. For example, the page for 1msy contains pairing types ncSW, ntSH, and ntHH. Note that the terms nc (in ncSW) and nt (in ntSH/ntHH) do not have the normal meanings in literature; they stands for near cis and near trans respectively.

Summary of steps used to identify nucleic acid structural components using DSSR

As shown in the figure above, DSSR adopts a base-centric terminology for the three edges. In principle, M (Major groove) in the DSSR classification corresponds to the Hoogsteen/CH-edge (H) in the LW notation, and the DSSR m (minor groove) to the LW Sugar-edge (S) if χ is in the anti conformation. In practice, direct DSSR/LW correspondences M/H and m/S are assumed, regardless of anti/syn base conformation. Moreover, the cis/trans assignment is the same for both notations. Within the DSSR implementation, the LW and DSSR classifications are thus strictly parallel in terms of cis/trans orientation and interacting edges. The DSSR scheme has the extra ± for relative base orientations.

The LW classifications implemented in DSSR may differ from those listed in the RNA 3D Hub website or other resources. These discrepancies normally occur in boundary cases where the assignment of cis/trans and interaction edges can be ambiguous. For ‘authentic’ LW classification results, users should consult the original publication of Leontis and Westhof (2001) and use the RNAView (Yang et al., 2003) or FR3D (Sarver et al., 2008) tools instead of DSSR.



Citations to 3DNA publications in the Web of Science

Recently I performed a survey of citations to thirteen 3DNA-related publications using Web of Science from Clarivate Analytics. The time range is from 2015 to 2020 (June 30), for a total of five-and-half years. The 1,050 citations span 223 scientific journals, covering a broad range of research fields such as biology, medicine, chemistry, physics, materials etc. Not surprisingly, the citing journals include Cell, Nature and sub-journals, Science, and PNAS.

Each of following six papers has been cited over 50 times, as detailed below. Adding the six numbers together, there are 962 citations, accounting for 92% of the total 1,050.

  1. [402 times in 138 journals] Lu,X.-J. and Olson,W.K. (2003) 3DNA: A software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res., 31, 5108–5121.
  2. [201 times in 81 journals] Lu,X.-J. and Olson,W.K. (2008) 3DNA: A versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nat. Protoc., 3, 1213–1227.
  3. [127 times in 71 journals] Zheng,G., Lu,X.J. and Olson,W.K. (2009) Web 3DNA––a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures. Nucleic Acids Res, 37, W240-6.
  4. [115 times in 57 journals] Olson,W.K., Bansal,M., Burley,S.K., Dickerson,R.E., Gerstein,M., Harvey,S.C., Heinemann,U., Lu,X.-J., Neidle,S., Shakked,Z., Sklenar,H., Suzuki,M., Tung,C.-S., Westhof,E., Wolberger,C. and Berman,H.M. (2001) A standard reference frame for the description of nucleic acid base-pair geometry. J. Mol. Biol., 313, 229–237.
  5. [66 times in 32 journals] Lu,X.-J., Bussemaker,H.J. and Olson,W.K. (2015) DSSR: An integrated software tool for dissecting the spatial structure of RNA. Nucleic Acids Res., 43, e142.
  6. [51 times in 41 journals] Lu,X.J., Shakked,Z. and Olson,W.K. (2000) A-form conformational motifs in ligand-bound DNA structures. J. Mol. Biol., 300, 819–40.

The top 21 journals that cite 3DNA papers 10 times or more are listed below. Nucleic Acids Research stands out, with a total of 148 citations, accounting for 14% of the total 1,050 citations.

148	Nucleic Acids Research
84	Journal of Physical Chemistry B
40	Physical Chemistry Chemical Physics
34	Biophysical Journal
33	Journal of Chemical Theory and Computation
29	Biochemistry
29	RNA
24	PLoS One
24	Scientific Reports
22	Journal of Biomolecular Structure & Dynamics
20	Bioinformatics
19	Journal of Chemical Information and Modeling
16	Nature Communications
15	Biopolymers
15	Journal of the American Chemical Society
12	Acta Crystallographica Section D: Structural Biology
12	Journal of Molecular Modeling
11	Chemistry: a European Journal
11	Journal of Chemical Physics
10	Journal of Biological Chemistry
10	Structure



Over 5000 registrations on the 3DNA Forum

As I am writing this blogpost on June 26, 2020, the registrations on the 3DNA Forum has reached 5,054. The numbers were 3,000 on October 15, 2016, 2,000 on on February 3, 2015, and 1,000 on February 27, 2013 respectively. For year 2020, the monthly registrations are 36 (January), 35 (February), 54 (March), 84 (April), 69 (May). As of June 26, the number is 56, which will more than likely pass 60 by the end of this month. The Covid-19 pandemic does not seem to having a negative effect on the registrations.

The over 5,000 registrations are from users all over the world. The 3DNA Forum remains spam free, and all questions are promptly answered. It is functioning well; certainly better than I originally imagined.

Overall, the Forum serves as a virtual platform for me to interact effectively with the ever-increasing user community. I greatly enjoy answering questions, fixing bugs, and making 3DNA/DSSR/SNAP better tools for real-world applications.



Cover images of the RNA Journal in 2020

Following my previous post 3DNA/blocview-PyMOL images in covers of the RNA journal in 2019, here is an update for 2020. The cover images of the January to July issues have all been generated with help of 3DNA and provided by the NDB:

RNA is displayed as a red ribbon; block bases use NDB colors: A—red, C—yellow, G—green, U—cyan. The image was generated using 3DNA/blocview and PyMol software. Cover image provided by the Nucleic Acid Database (

Here is the composite figure of the seven cover images, with the brand new DSSR-PyMOL schematics for comparison.

3DNA/blockview-PyMOL and DSSR-PyMOL cartoon-block schematics in the covers of the RNA journal in 2020

Details of the seven structures illustrated in the cover images are described below:

  1. January 2020 Pumilio homolog PUF domain in complex with RNA (PDB id: 5yki; Zhao YY, Mao MW, Zhang WJ, Wang J, Li HT, Yang Y, Wang Z, Wu JW. 2018. Expanding RNA binding specificity and affinity of engineered PUF domains. Nucleic Acids Res 46: 4771–4782). Engineered nine-repeat PUF domain binds to its RNA target specifically and with high binding affinity.
  2. February 2020 Aprataxin RNA–DNA deadenylase product complex (PDB id: 6cvo; Tumbale P, Schellenberg MJ, Mueller GA, Fairweather E, Watson M, Little JN, Krahn J, Waddell I, London RE, Williams RS. 2018. Mechanism of APTX nicked DNA sensing and pleiotropic inactivation in neurodegenerative disease. EMBO J 37: e98875). Human aprataxin RNA–DNA deadenylase protects genome integrity and corrects abortive DNA ligation arising during ribonucleotide excision repair and base excision DNA repair.
  3. March 2020 PreQ1 riboswitch (PDB id: 6e1w; Connelly CM, Numata T, Boer RE, Moon MH, Sinniah RS, Barchi JJ, Ferre-D’Amare AR, Schneekloth Jr JS. 2019. Synthetic ligands for PreQ1 riboswitches provide structural and mechanistic insights into targeting RNA tertiary structure. Nat Commun 10: 1501). Class I PreQ1 riboswitch regulates downstream gene expression in response to its cognate ligand PreQ1 (7-aminomethyl-7-deazaguanine).
  4. April 2020 Hatchet ribozyme (PDB id: 6jq6; Zheng L, Falschlunger C, Huang K, Mairhofer E, Yuan S, Wang J, Patel DJ, Micura R, Ren A. 2019. Hatchet ribozyme structure and implications for cleavage mechanism. Proc Natl Acad Sci 116: 10783–10791). This crystal structure of the hatchet ribozyme product features a compact symmetric dimer.
  5. May 2020 Adenovirus virus-associated RNA (PDB id: 6ol3; Hood IV, Gordon JM, Bou-Nader C, Henderson FE, Bahmanjah S, Zhang J. 2019. Crystal structure of an adenovirus virus-associated RNA. Nat Commun 10: 2871). Acutely bent viral RNA fragment is a protein kinase R inhibitor and features an unusually structured apical loop, a wobble-enriched, coaxially stacked apical and tetra-stems, and a central domain pseudoknot that resembles codon-anticodon interactions.
  6. June 2020 Archeoglobus fulgidus L7Ae bound to cognate K-turn (PDB id: 6hct; Huang L, Ashraf S, Lilley DMJ. 2019. The role of RNA structure in translational regulation by L7Ae protein in archaea. RNA 25: 60–69). 50S archaeal ribosome protein L7Ae binds to a K-turn structure in the 5′-leader of the mRNA of its structural gene to regulate translation.
  7. July 2020 Spinach RNA aptamer/Fab complex (PDB id: 6b14; Koirala D, Shelke SA, Dupont M, Ruiz S, DasGupta S, Bailey LJ, Benner SA, Piccirilli JA. 2018. Affinity maturation of a portable Fab-RNA module for chaperone-assisted RNA crystallography. Nucleic Acids Res 46: 2624–2635). Novel Fab-RNA module can serve as an affinity tag for RNA purification and imaging and as a chaperone for RNA crystallography.



Paper on DSSR-PyMOL schematics

The paper, titled DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL, has just been published in Nucleic Acids Research (online on May 22, 2020). Here is the abstract:

Sophisticated analysis and simplified visualization are crucial for understanding complicated structures of biomacromolecules. DSSR (Dissecting the Spatial Structure of RNA) is an integrated computational tool that has streamlined the analysis and annotation of 3D nucleic acid structures. The program creates schematic block representations in diverse styles that can be seamlessly integrated into PyMOL and complement its other popular visualization options. In addition to portraying individual base blocks, DSSR can draw Watson-Crick pairs as long blocks and highlight the minor-groove edges. Notably, DSSR can dramatically simplify the depiction of G-quadruplexes by automatically detecting G-tetrads and treating them as large square blocks. The DSSR-enabled innovative schematics with PyMOL are aesthetically pleasing and highly informative: the base identity, pairing geometry, stacking interactions, double-helical stems, and G-quadruplexes are immediately obvious. These features can be accessed via four interfaces: the command-line interface, the DSSR plugin for PyMOL, the web application, and the web application programming interface. The supplemental PDF serves as a practical guide, with complete and reproducible examples. Thus, even beginners or occasional users can get started quickly, especially via the web application at

A brief history on DNA/RNA schematics as implemented in SCHNAaP/SCHNArP, 3DNA, and now in DSSR:

The idea of representing bases and WC-pairs as rectangular blocks came from the pioneering work of Calladine et al. (27,28) The block schematics were first implemented in the pair of SCHNAaP/SCHNArP programs (29,30) for rigorous analysis and reversible rebuilding of double-helical nucleic acid structures. The algorithms that underpinned SCHNAaP/SCHNArP laid the foundation of ‘analyze’ and ‘rebuild’, two core components of the 3DNA suite of programs (31–33). 3DNA also takes advantage of the standard base reference frame (34), and comprises quite a few other related programs. One of them is ‘blocview’, a script which calls several 3DNA utility programs to generate individual base blocks and set the view, MolScript (35) to produce backbone ribbons, and Raster3D (36) to render the composite image. The 3DNA ‘blocview’ schematics catch characteristic attributes of nucleic acid structures. They have gradually become popular and been adopted into the RCSB PDB (1) and the NDB (37), and then propagated into other bioinformatics resources (e.g., the ‘RNA Structure Atlas’ website hosted by the Leontis-Zirbel RNA group).

DSSR supersedes ‘blocview’ by eliminating all the internal and external dependencies of the 3DNA utility program. DSSR produces block representations, not only of individual bases but also WC-pairs and G-tetrads, that can be fed directly into PyMOL. The DSSR-PyMOL integration is easier to use, has more features, and produces better schematics than the original 3DNA-blocview approach.

DSSR-PyMOL schematic for PDB entry 6ol3 3DNA-blocview-PyMOL schematic for PDB entry 6ol3
Schematic image for PDB entry 6ol3 auto-generated via the DSSR-PyMOL integration Cover image of the May 2020 issue of the RNA Journal, “generated using 3DNA/blocview and PyMol software by the Nucleic Acid Database”

Indeed, the base block schematics have continuously evolved for over two decades, as appreciated in the acknowledgements:

I would like to thank Christopher A. Hunter, Christopher R. Calladine, Helen M. Berman, Catherine L. Lawson, Zukang Feng, Wilma K. Olson and Harmen J. Bussemaker for their helpful input on the block schematic during its continuous evolution for over two decades. I appreciate Thomas Holder (PyMOL Principal Developer, Schrödinger, Inc.) for writing the DSSR plugin for PyMOL, and for providing insightful comments on the manuscript and the web application interface. I also thank Jessalyn Lu and Yin Yin Lu for proofreading the manuscript, and the user community for feedback.

Notably, the supplemental PDF has been diligently written to serve as a practical guide, with complete and reproducible examples. In fact, the paper concludes with the following two sentences:

Finally, all results reported here are completely reproduceable (see the supplemental PDF). Any questions related to this work are welcome and will be openly addressed on the 3DNA Forum (



Context-aware in silico base mutations enabled by DSSR 2.0

As of version 2.0 (to be released soon), DSSR has a new module for in silico base mutations that is context sensitive. Powered by the DSSR analysis engine, the module allows users to perform base mutations in unprecedented flexibility and convenance. Here are some examples:

  • Mutate all bases in hairpin loops to a specific base (e.g., G)
  • Mutate all non-stem bases to a specific base (e.g., U)
  • Mutate bases 2-12 to a specific base (e.g., A) regardless of context
  • Mutate bases 1-10 in a given structure to a new sequence (e.g., AUAUAUAUAU)
  • Mutate all bases of the same type to another (e.g., A to G)
  • Mutate all bases of the same type to another (e.g., C to U) except for some nucleotides
  • Mutate all G-C Watson-Crick (WC) pairs to C-G WC pairs, and A-U to U-A
  • Mutate all G-tetrads in G-quadruplexes to non-G-tetrads (e.g., U-tetrads)

By default, the mutation preserves both the geometry of the sugar-phosphate backbone and the base reference frame (position and orientation). As a result, re-analyzing the mutated model gives the same base-pair and step parameters as those of the original structure.

Over the years, the 3DNA mutate bases program has been cited in the literature and patent, including the following ones:

The DSSR mutation module has completely obsoleted the mutate_bases program distributed in 3DNA v2.x. In addition to serving as a drop-in replacement of mutate_bases, the DSSR approach offers much more features and versatility: it is simply better.



May's article on "The Best Ways to Study DNA and Protein Interactions"

In late March, I was approached by Mike May. He was then writing an article for Biocompare about DNA-protein interactions and asked me to answer a few questions on “What features of 3DNA be used in studying DNA-protein interactions?” and “Please provide 1-2 examples.” Initially, I was a bit surprised by the contact. Thus, I visited his online profile and Amazon Author Page. I also read a couple of his previous publications. Impressed by his track records, I answered his requests and our following communications were as smooth and professional as I could have ever imagined.

The paper The Best Ways to Study DNA and Protein Interactions has now been published, and is freely accessible. It includes the following content:

3DNA creator and maintainer Xiang-Jun Lu mentioned a couple of ways that the software has been used. For example, he noted that “3DNA can analyze all DNA-protein complexes in the Protein Data Bank—PDB—in an automatic, consistent, and robust manner,” and other bioinformatic resources have adopted this feature of 3DNA. He added that scientists have used 3DNA to “understand the structural basis on how transcription factors recognize methylated DNA.” Moreover, 3DNA is continuously developed. A new feature of 3DNA is the automatic identification and comprehensive characterization of G-quadruplexes, a noncanonical DNA structure formed from guanine-rich base sequences.

The bioinformatics resource I used as an example is the paper DNAproDB: an expanded database and web-based tool for structural analysis of DNA–protein complexes by the Rohs lab. The phrase “to understand the structural basis on how transcription factors recognize methylated DNA” refers to the article Toward a mechanistic understanding of DNA methylation readout by transcription factors by the Bussemaker lab. Both works employed DSSR and SNAP, two sophisticated programs I created and maintained over the past ten years, and they have largely obsoleted the original 3DNA suite of programs.

The image I provided is a DSSR-PyMOL schematic based on PDB entry 6LDM. The 6LMD picture features a G-quadruplex, for which DSSR comes with an unmatched set of features (including automatic identification and comprehensive annotations). See the page for survey results, curated using DSSR, of all G-quadruplexes from the PDB.

This image of a protein-DNA complex (PDB entry 6LDM) shows the protein (purple), the DNA G-quadruplex (green) and thymine (blue). The image was created using the 3DNA-DSSR program and PyMOL. Image courtesy of Xiang-Jun Lu.

DSSR-PyMOL schematic for PDB entry 6ldm



DSSR is cited in a Nature paper on RIC-seq for profiling RNA–RNA interactions

I recently read the paper RIC-seq for global in situ profiling of RNA–RNA spatial interactions published in Nature by the Yuanchao Xue team from the Chinese Academy of Sciences. The abstract is as below:

Highly structured RNA molecules usually interact with each other, and associate with various RNA-binding proteins, to regulate critical biological processes. However, RNA structures and interactions in intact cells remain largely unknown. Here, by coupling proximity ligation mediated by RNA-binding proteins with deep sequencing, we report an RNA in situ conformation sequencing (RIC-seq) technology for the global profiling of intra- and intermolecular RNA–RNA interactions. This technique not only recapitulates known RNA secondary structures and tertiary interactions, but also facilitates the generation of three-dimensional (3D) interaction maps of RNA in human cells. Using these maps, we identify noncoding RNA targets globally, and discern RNA topological domains and trans-interacting hubs. We reveal that the functional connectivity of enhancers and promoters can be assigned using their pairwise-interacting RNAs. Furthermore, we show that CCAT1-5L—a super-enhancer hub RNA—interacts with the RNA-binding protein hnRNPK, as well as RNA derived from the MYC promoter and enhancer, to boost MYC transcription by modulating chromatin looping. Our study demonstrates the power and applicability of RIC-seq in discovering the 3D structures, interactions and regulatory roles of RNA.

The Methods part contains the following section, where DSSR is cited along with several other software tools:

Structural analysis of 28S rRNA. The RIC-seq reads aligned to 45S pre-rRNA (NR_046235.3) were collected and used to construct the interaction matrix shown in Fig. 1h. A Knight–Ruiz normalization al- gorithm, widely used in the normalization of Hi-C contact matrices51, was applied to eliminate sequencing bias along rRNA. For building the physical interaction map of 28S rRNA, the cryo-EM model of human 80S ribosome (RCSB Protein Data Bank (PDB) ID 4V6X) was down- loaded, and the spatial distances between every 5-nt bin in 28S rRNA were calculated using the mean spatial coordinates of carbon atoms in each 5-nt bin. Watson–Crick and non-Watson–Crick base pairs were identified using the DSSR software52. The 3D structure of the ribosome was visualized by the PyMOL system (Educational version, https:// For the missing structures in 28S rRNA, we combined intramolecular RNA–RNA interactions detected by RIC-seq with the RNAstructure algorithm53 to deduce their 2D structures.

There are several other well-known programs for identifying and annotating RNA base pairs, including RNAView, FR3D, and MC-Annotate. One may wonder why DSSR is used here. In addition to asking the authors, interested viewers could simply test for themselves: try the different tools on PDB entry 4V6X and see what happens.

It is worth mentioning that a new DSSR-related paper “DSSR-enabled innovative schematics of 3D nucleic acid structures with PyMOL” has recently been accepted by publication in Nucleic Acids Research. I will shortly write another post on this topic when this paper is officially published online. To see DSSR-PyMOL schematics in action, please visit Here is the abstract of the new DSSR-PyMOL article:

Sophisticated analysis and simplified visualization are crucial for understanding complicated structures of biomacromolecules. DSSR (Dissecting the Spatial Structure of RNA) is an integrated computational tool that has streamlined the analysis and annotation of 3D nucleic acid structures. The program creates schematic block representations in diverse styles that can be seamlessly integrated into PyMOL and complement its other popular visualization options. In addition to portraying individual base blocks, DSSR can draw Watson-Crick pairs as long blocks and highlight the minor-groove edges. Notably, DSSR can dramatically simplify the depiction of G-quadruplexes by automatically detecting G-tetrads and treating them as large square blocks. The DSSR-enabled innovative schematics with PyMOL are aesthetically pleasing and highly informative: the base identity, pairing geometry, stacking interactions, double-helical stems, and G-quadruplexes are immediately obvious. These features can be accessed via four interfaces: the command-line interface, the DSSR plugin for PyMOL, the web application, and the web application programming interface. The supplemental PDF serves as a practical guide, with complete and reproducible examples. Thus, even beginners or occasional users can get started quickly, especially via the web application at



SARS-CoV-2, RNA G-Quadruplexes and 3DNA

I recently noticed a bioRxiv preprint, titled Role of RNA Guanine Quadruplexes in Favoring the Dimerization of SARS Unique Domain in Coronaviruses by a European team consisting of scientists from France, Italy, and Spain. The abstract is as follows. Figure 1 shows a schematic representation of the mRNA with a G-Quadruplex structure, functioning in a healthy cell and an infected cell by coronavirus.

Coronaviruses may produce severe acute respiratory syndrome (SARS). As a matter of fact, a new SARS-type virus, SARS-CoV-2, is responsible of a global pandemic in 2020 with unprecedented sanitary and economic consequences for most countries. In the present contribution we study, by all-atom equilibrium and enhanced sampling molecular dynamics simulations, the interaction between the SARS Unique Domain and RNA guanine quadruplexes, a process involved in eluding the defensive response of the host thus favoring viral infection of human cells. The results obtained evidence two stable binding modes with guanine quadruplexes, driven either by electrostatic (dimeric mode) or by dispersion (monomeric mode) interactions, are proposed being the dimeric mode the preferred one, according to the analysis of the corresponding free energy surfaces. The effect of these binding modes in stabilizing the protein dimer was also assessed, being related to its biological role in assisting SARS viruses to bypass the host protective response. This work also constitutes a first step of the possible rational design of efficient therapeutic agents aiming at perturbing the interaction between SARS Unique Domain and guanine quadruplexes, hence enhancing the host defenses against the virus.

Figure 1) Schematic representation of the mRNA function in a) a healthy cell and b) an infected cell by coronavirus. Panel b) showcases the influence of viral SUD binding to G4 sequences of mRNA that encodes crucial proteins for the apoptosis/cell survival regulation and other signaling paths.

In the manuscript, the software tools employed in this MD study are described as below:

… Both protein and RNA have been described with the amber force field including the bsc1 corrections, and the MD simulations have been performed in the constant pressure and temperature ensemble (NPT) at 300K and 1 atm. All MD simulations have been performed using the NAMD code and analyzed via VMD, the G4 structure has also been analyzed with the 3DNA suite.

I am glad that 3DNA has played a role in the analysis of G-quadruplexes in this timely contribution. In particular, I would like to draw attention of the community to 3DNA-DSSR which has a brand-new module dedicated to the automatic identification and comprehensive characterization of G-quadruplexes. The DSSR-annotated G-quadruplexes from the PDB should be of great interest to a wide audience, especially the experimentalists. As a concrete example, the authors noted that “The crystal structure … of the oligonucleotide (pdb 1J8G) have been chosen coherently with the experimental work performed by Tan et al”. Follow the link to see results of DSSR-derived G-quadruplex features in PDB entry 1J8G and you are guaranteed to see features not available elsewhere.

Note added on July 9, 2020: This paper has been published in J. Phys. Chem. Lett. 2020, 11, 5661−5667.



SNAP for the analysis of TF-DNA complexes containing 5-methyl-cytosines

The Kribelbauer et al. article, Towards a mechanistic understanding of DNA methylation readout by transcription factors has recently been published in the Journal of Molecular Biology (JMB). I am honored to be among the author list, and I learned a lot during the process. For the project, I added the --methyl-C (short-form: --5mc) option to SNAP (v1.0.6-2019sep30) for the automatic identification and annotation of DNA-transcription factor (TF) complexes containing 5-methyl-cytosine (5mC). The results are presented in a dynamic table, easily accessible at URL, and summarized in Fig. 1 “Structural basis of how TFs recognize methylated DNA” (see below) of the JMB paper.

Fig. 1. Structural basis of how TFs recognize methylated DNA

Details on the SNAP-enabled curation of TF-DNA complexes containing 5mC from atomic coordinates in the Protein Data Bank (PDB) are available in a tutorial page at In essence, the process can be easily understood via a concrete example with PDB id 4m9e, as shown below.

x3dna-snap --methyl-C --type=base -i=4m9e.pdb -o=4m9e-5mC.out

Here the --methyl-C option is specific for 5mC-DNA, and --type=base ensures that at least one DNA base atom is contacting protein amino acid(s). If these conditions are fulfilled, SNAP would produce two additional 5mC-related files, apart from the normal output file (i.e., 4m9e-5mC.out, as specified in the example):

  • 4m9e-5mC.txt — a simple text file with the following contents:
4m9e:B.5CM5: stacking-with-A.ARG443 is-WC-paired is-in-duplex [+]:GcG/cGC
4m9e:C.5CM5: other-contacts is-WC-paired is-in-duplex [-]:cGT/AcG
  • 4m9e-5mC.pdb — a corresponding PDB file, potentially multi-model, two as in this case. Moreover, the cluster of interacting residues (DNA nucleotides and protein amino acids) is oriented in the standard base reference frame of 5mC, allowing for easy comparison and direct overlap of multiple clusters.

In practice, SNAP needs to take care of many details for the automatic identification and annotation of 5mC-DNA-TF complexes directly from PDB entries. For example, 5mC in DNA is designated 5CM and the 5-methyl carbon atom is named C5A in the PDB (see the blogpost 5CM and 5MC, two forms of 5-methylcytosine in the PDB). Moreover, the --type=base option is employed to ensure that base atoms (regardless sugar-phosphate atoms) of 5mC are directly involved in interactions with amino acids.

It is also worth noting the combined use of DSSR for the generation of molecular images (rendered with PyMOL), as shown below. Here the DSSR options --block-file=fill-hbond (fill to fill base rings and hbond to draw hydrogen bonds) and --cartoon-block=sticks-label are used. The 3DNA DSSR/SNAP combo is a unique and powerful toolset for structural bioinformatics, as demonstrated in DNAproDB from the Rohs lab (see my blogpost SNAP and DSSR in DNAproDB). The JMB paper represents yet another example. I can only expect to see more combined DSSR/SNAP applications in the future.

DSSR-PyMOL image for PDB id: 4m9e



3DNA-DSSR is linked in the G4-society website

A couple of months ago, I came across the homepage of the newly-established G4 Society on G-quadruplexes (G4s). I checked the “Online tools” section and found a few links to G4 databases and sequence-based predication programs (e.g., G4Hunter). No tools, however, were listed for G4 identification and characterization from 3D atomic coordinates as those deposited in the Protein Data Bank (PDB). So I filled out the contact form and provided a brief description of 3DNA-DSSR, including a link to the website of G4s auto-curated with DSSR from the PDB.

I’ve recently visited the G4-society website again. I am pleased to see that 3DNA-DSSR is now listed under Online tools as a “program for detections/annotations of G4 from atomic coordinates in PDB or PDBx/mmCIF format”. The G4 module of 3DNA-DSSR has been created to streamline the identification and annotation of 3D structures of G4s. The collection of G4s in the PDB, available at, is updated weekly. It represents a unique resource for the G4 community. Hopefully, its value will be more widely appreciated thanks to the link from the G4-society website.

At the G4-society homepage, I noticed the following two items in the “News” section (on December 13, 2019):

The Quadruplex Meeting Report

Meeting report: Seventh International Meeting on Quadruplex Nucleic Acids (Changchun, P.R. China, September 6e9, 2019) written by Jean-Louis Mergny. Reading through the report, I noticed the following:

Jonathan B. Chaires (U. Louisville, KY, USA) provided an overview and historical perspective of the quadruplex field in his inaugural lecture. As of August 2019, the quadruplex field gathers 8467 articles and 253,174 citations in the Science Citation Index. Over 200 G4 structures are available in the PDB.

I did not know how the survey of G4s in the PDB was performed. Based on my data, the PDB-G4 structures was already over 300 as of August 2019. As of December 11, 2019, the number of G4 structures in the PDB is 329. Importantly, the PDB-G4 website compiled using 3DNA-DSSR contains not only citation information but also detailed annotations and schematic images not available elsewhere. Here are a few recent examples:

  • PDB id: 6ge1 — “Unraveling the structural basis for the exceptional stability of RNA G-quadruplexes capped by a uridine tetrad at the 3’ terminus.” by Andralojc et al. in RNA (2019).
  • PDB id: 6gh0 — “Two-quartet kit* G-quadruplex is formed via double-stranded pre-folded structure.” by Kotar et al. in Nucleic Acids Res. (2019).
  • PDB id: 6e8u — “Structure and functional reselection of the Mango-III fluorogenic RNA aptamer.” by Trachman et al. in Nat. Chem. Biol. (2019).
  • PDB id: 6ac7 —“Structure of a (3+1) hybrid G-quadruplex in the PARP1 promoter.” by Sengar et al. in Nucleic Acids Res. (2019).

The Important Paper

A guide to computational methods for G-quadruplex prediction by Emilia Puig Lombardi and Arturo Londoňo-Vallejo in Nucleic Acids Res. (2019), which presents an updated overview of G4 prediction algorithms. I am impressed by the large number of sequence-based G4 prediction software tools, including the most recent G4-iM Grinder. Nevertheless, as noted by the authors in the concluding remarks, “All computational G-quadruplex prediction approaches have their drawbacks and limitations despite the recent advances in the field and the introduction of validation steps based on experimental data.”

The G4 module in 3DNA-DSSR belongs to a completely different category of software tool. It does not ‘predict’ G4 propensity/stability from a base sequence, but identify and annotate G4s in a 3D atomic coordinate file. It complements sequence-based predicting tools by gaining insights into the 3D G4 structures and refining folding rules to improve performance of prediction tools. Based on my knowledge, the 3D G4 structures contains features that are not captured by any of the sequence-based prediction tools.

While reading the review article, I found Fig. 1 informative (see below). The right side of Fig. 1A shows a “cartoon representation of the Oxytricha telomeric DNA G4 crystal structure (PDB accession 1JPQ (112))” using PyMOL. In comparison, the cartoon-block image auto-generated via 3DNA-DSSR and PyMOL for PDB id: 1jpq is shown at the bottom. The DSSR-PyMOL version is obviously different, presumably simpler and more informative, from that illustrated in Fig. 1A.

Figure 1. From guanines to G-quadruplexes

3DNA-DSSR cartoon-block schematic for PDB entry 1jpq, rendered with PyMOL



3DNA/blocview-PyMOL images in covers of the RNA journal

I recently performed a quick survey of the cover images of the RNA journal in 2019. I was pleased to find that 9 out of the 12 cover images were provided by the Nucleic Acid Database where 3DNA/blockview and PyMOL were employed, as noted below:

The RNA backbone is displayed as a red ribbon; bases are shown as blocks with NDB coloring: A—red, C—yellow, G—green, U—cyan; geneticin ligands are shown in spacefill with element colors: C—white, N—blue, O—red. The image was generated using 3DNA/blocview and PyMol software.

Details of the 9 cover images are listed below:

  1. January 2019 Rhodobacter sphaeroides Argonaute with guide RNA/target DNA duplex containing noncanonical A-G pair (PDB code: 6d9k)
  2. April 2019 Group I self-splicing intron P4-P6 domain mutant U131A (PDB code: 6d8l)
  3. May 2019 Crystal structure of T. thermophilus 50S ribosomal protein L1 in complex with helices H76, H77, and H78 of 23S RNA (PDB code: 5npm)
  4. June 2019 Crystal structure of ykoY-mntP riboswitch chimera bound to cadmium (PDB code: 6cc3)
  5. July 2019 G96A mutant of the PRPP riboswitch from T. mathranii bound to ppGpp (PDB code: 6ck4)
  6. August 2019 Crystal structure of the metY SAM V riboswitch (PDB code: 6fz0)
  7. October 2019 Crystal structure of protease factor Xa bound to RNA aptamer 11F7t and rivaroxaban (PDB code: 5vof)
  8. November 2019 Drosophila melanogaster nucleosome remodeling complex (PDB code: 6f4g)
  9. December 2019 Crystal structure of the Homo Sapiens cytoplasmic ribosomal decoding site in complex with Geneticin (PDB code: 5xz1)

Here is the composite figure of the 9 cover images.

3DNA/blockview-PyMOL cartoon-block schematics in the covers of the RNA journal in 2019

See also:



Web API to 3DNA

I’ve created a web API to DSSR and SNAP, and fiber models. The overall help message is available via Individually, each program is accessed as below.

Help message on x3dna-dssr (DSSR):

Usage with 'http' (HTTPie):
    http -f [options] url=|model@
    http   -- display this help message

    json=true-or-FALSE(default)    [e.g., json=true # JSON output]
    pair=true-or-FALSE(default)    [e.g., pair=1    # base-pair only]
    hbond=true-or-FALSE(default)   [e.g., hbond=t   # H-bonding info]
    more=true-or-FALSE(default)    [e.g., more=y    # further details]

Required parameter:
    url=URL-to-coordinate-file [e.g., url=]
    model@coordinate-file      [e.g., model@1ehz.cif]
    # Only one must be specified. 'url' precedes 'model' when both are specified.
    # The coordinate file must be in PDB or PDBx/mmCIF format, optionally gzipped.

    http -f url=
    http -f model@1ehz.cif pair=1
    # with 'curl'
    curl -F 'url='
    curl -F 'model=@1msy.pdb' -F 'pair=1'

    The web API has an upper limit on coordinate file size (gzipped): < 6 MB

Help message on x3dna-snap (SNAP):

Usage with 'http' (HTTPie):
    http -f [options] url=|model@
    http   -- display this help message

    json=true-or-FALSE(default)    [e.g., json=true # JSON output]
    hbond=true-or-FALSE(default)   [e.g., hbond=t   # H-bonding info]

Required parameter:
    url=URL-to-coordinate-file [e.g., url=]
    model@coordinate-file      [e.g., model@1oct.cif]
    # Only one must be specified. 'url' precedes 'model' when both are specified.
    # The coordinate file must be in PDB or PDBx/mmCIF format, optionally gzipped.

    http -f url=
    http -f model@1oct.cif json=1
    # with 'curl'
    curl -F 'url='
    curl -F 'model=@1oct.cif' -F 'json=1'

    The web API has an upper limit on coordinate file size (gzipped): < 6 MB

Help message on 56 fiber models:

Usage with 'http' (HTTPie):
    http    # display this help message
    http    # show a list of available fiber models (56 in total)
    http  # build model 'str_id' in the range of [1, 56]
    http    # generate a model with common names as shown below:
              A-DNA, B-dna, C_DNA, D-DNA, ZDNA, RNA, RNAduplex, PaulingTriplex, G4
              Case does not matter, and the separator can be '-' or '_' or omitted.
              So a-dna, A-dNA, a_DNA, or ADNA is valid for building an A-DNA model.

Options (via query strings, or form fields):
    seq=base-sequence # A, C, G, T, U for generic model
    repeat=number     # number of repeats of the sequence
    cif=1             # output file in mmCIF format

Examples with 'http' (HTTPie):
    http       # model no. 1 (i.e., calf thymus A-DNA model)
    http -f seq=A3TTT repeat=2  # specific sequence, repeated twice
    http     # single-stranded RNA model
    http  # double-stranded RNA model
    http # the triplex model of Pauling & Corey
    http      # G-quadruplex model
    # with 'curl'
    curl -d 'seq=A3TTT' -d 'repeat=2'

    The web API has two upper limits: repeats < 1,000, and nucleotides < 10,000.



DSSR-enhanced visualization of nucleic acid structures in PyMOL

The website (see screenshot below) aims to showcase DSSR-enabled cartoon-block schematics of nucleic acid structures using PyMOL. It presents a simple interface to browse pre-calculated PDB entries with a set of default settings: long rectangular blocks for Watson-Crick base-pairs, square blocks for G-tetrads in G-quadruplexes, with minor-groove edges in black. Users can also specify an URL to a PDB- or mmCIF-formatted file or upload such an atomic coordinates file directly, and set several common options to customerize to the rendered image.

Moreover, a web API to DSSR-PyMOL schematics is available to allow for its easy integration into third-party tools.

Screenshot of the homepage of DSSR/PyMOL schematics

Input a PDB id

Pre-calculated cartoon-block images together with summary information are available for PDB entries of nucleic-acid-containing structures. Note that gigantic structures like ribosomes that are only represented in mmCIF format are excluded from the resource. The base block images are most effective for small to medium-sized structures.

Here are a few examples:

  • 1ehz, the crystal structure of yeast phenylalanine tRNA at 1.93-Å resolution
  • 2lx1, the major conformation of the internal loop 5’GAGU/3’UGAG
  • 2grb”, the crystal structure of an RNA quadruplex containing inosine-tetrad
  • 4da3, the crystal structure of an intramolecular human telomeric DNA G-quadruplex 21-mer bound by the naphthalene diimide compound MM41
  • 1oct, crystal structure of the Oct-1 POU domain bound to an octamer site
  • 2hoj, the crystal structure of an E. coli thi-box riboswitch bound to thiamine pyrophosphate, manganese ions

Each entry is shown with images in six orthogonal perspectives: front, back, right, left, top, bottom. The ‘front’ image (upper-left in the panel) is oriented into the most-extended view with the DSSR --blocview option. The corresponding PyMOL session file and PDB coordinate file are available for download. One can also visualize the structure interactively via 3Dmol.js.

Sample PDB entries

Users can browse random samples of pre-calculated PDB entries. The number should be between 3 and 99, with a default of 12 entries (see below for an example). Simply click the ‘Submit’ button or the “Random samples (3 to 99)”: link to see results of randomly picked 12 PDB entries each time.

Specify a coordinate file

The atomic coordinate file must be in PDB or mmCIF format, and can be optionally gzipped (.gz). One can either specify an URL to or select a coordinate file. Several common options are available to allow for user customizations.

Web API help message

Usage with 'http' (HTTPie):
    http -f [options] url=|model@
    http  -- for a pre-calculated PDB entry
    http        -- display this help message
    block_file=styles-in-free-text-format [e.g., block_file=wc-minor]
    block_color=nt-selection-and-color    [e.g., block_color='A:pink']
    block_depth=thickness-of-base-block   [e.g., block_depth=1.2]
    r3d_file=true-or-FALSE(default)       [e.g., r3d_file=true]
    raw_xyz=true-or-FALSE(default)        [e.g., raw_xyz=true]
Required parameter
    url=URL-to-coordinate-file [e.g., url=]
    model@coordinate-file      [e.g., model@1ehz.cif]
    # Only one must be specified. 'url' precedes 'model' when both are specified.
    # The coordinate file must be in PDB or PDBx/mmCIF format, optionally gzipped.
    http -f block_file='wc-minor' model@1ehz.cif r3d_file=t
    http -f url= -d -o 1ehz.png
    http -d -o 1ehz.png
    # with 'curl'
    curl -F 'model=@1msy.pdb' -F 'block_file=wc-minor' -F 'r3d_file=1'
    curl -F 'url=' -o 1ehz.png
    curl -o 1ehz.png

Sample images




While reading DNAproDB: an expanded database and web-based tool for structural analysis of DNA–protein complexes, I noticed SNAP and DSSR being mentioned. The detailed citations are as below:

Information about individual nucleotide–residue interactions is also provided, such as hydrogen bonding, interaction geometry (based on SNAP (10)), buried solvent accessible surface areas and identification of the interacting residue and nucleotide moieties …

DNAproDB assigns a geometry for every nucleotide–residue interaction identified using SNAP, a component of the 3DNA program suite (10). The residues for which probabilities are shown are those with planar side chains so that a stacking conformation can be defined.

Base pairing and base stacking between nucleotides are identified using the program DSSR (20).

SNAP and DSSR are two (relatively) new programs in the 3DNA software suite. As the author, I am always glad to see them being cited explicitly in literature. The fact that SNAP and DSSR are cited together by DNAproDB, however, is especially significant. I am aware of the initial version of DNAproDB, but I definitely like the updated one better. This is what I recently wrote in response to a question on the 3DNA Forum:

Regarding DNA-protein interactions in general, you may want to have a look of DNAproDB from the Remo Rohs laboratory. A new paper has just been published in NAR, ‘DNAproDB: an expanded database and web-based tool for structural analysis of DNA–protein complexes’.

I’ve no doubt that SNAP and DSSR would be widely used in applications related to DNA/RNA structural bioinformatics. DSSR (to a lesser extent, SNAP) represents my view of what a scientific software tool should be.



ONZ classification of G-tetrads

Recently I read the article Topology-based classification of tetrads and quadruplex structures in Bioinformatics by Popenda et al. In this work, the authors proposed an ONZ classification scheme of G-tetrads in intramolecular G-quadruplexes (G4) as shown below (Fig. 2 in the publication):

ONZ classification of G-tetrads in intramolecular G-quadruplexes

I am glad to find that DSSR has been used as a component in their computational tool ElTetrado to automatically identify and classify tetrads and quadruplexes.

Structures from both sets were analysed using self-implemented programs along with DSSR software from the 3DNA suite (Lu et al. (2015)). From DSSR, we acquired the information about base pairs and stacking.

I like the ONZ classification scheme: it is simple in concept yet provides a new perspective for the topologies of G-tetrads in intramolecular G4 structures. So I implemented the idea in DSSR v1.9.8-2019oct16, with this feature available via the --g4-onz option. Note that ElTetrado, according to the authors, is applicable to ONZ classifications of general types of tetrads and quadruplexes. The DSSR implementation of ONZ classifications, on the other hand, is strictly limited to G-tetrads in intramolecular G4 structures.

The DSSR ONZ classification results match the ones reported in Figs. 1, 5, and 6 of the Popenda et al. paper. For example, for PDB entry 6H1K (Fig. 6), the relevant results with the --g4-onz option and without it are listed below:

# x3dna-dssr -i=6h1k.pdb --g4-onz
List of 3 G-tetrads
   1 glyco-bond=s--- groove=w--n planarity=0.149 type=planar Z- nts=4 GGGG A.DG1,A.DG20,A.DG16,A.DG27
   2 glyco-bond=-sss groove=w--n planarity=0.136 type=planar Z+ nts=4 GGGG A.DG2,A.DG19,A.DG15,A.DG26
   3 glyco-bond=--s- groove=-wn- planarity=0.307 type=other  O+ nts=4 GGGG A.DG17,A.DG21,A.DG25,A.DG28
# ---------------------------------------
# x3dna-dssr -i=6h1k.pdb 
#   without option --g4-onz
List of 3 G-tetrads
   1 glyco-bond=s--- groove=w--n planarity=0.149 type=planar nts=4 GGGG A.DG1,A.DG20,A.DG16,A.DG27
   2 glyco-bond=-sss groove=w--n planarity=0.136 type=planar nts=4 GGGG A.DG2,A.DG19,A.DG15,A.DG26
   3 glyco-bond=--s- groove=-wn- planarity=0.307 type=other  nts=4 GGGG A.DG17,A.DG21,A.DG25,A.DG28

With the --json option, the ONZ classification results are always available. An example is shown below for PDB entry 6H1K (Fig. 6):

# x3dna-dssr -i=6h1k.pdb --json | jq -c '.G4tetrads[] | [.nts_long, .topo_class]'



H-bonds reported by DSSR and SNAP

I recently read a short communication by Pavel Afonine, titled phenix.hbond: a new tool for annotation hydrogen bonds in the July 2019 issue of the Computational Crystallography Newsletter (CCN). It appears that every bioinformatics tool (e.g., PyMOL or Jmol) has its own implementation of an algorithm on calculating H-bonds, one of the fundamental stabilizing forces of proteins and DNA/RNA structures. So does 3DNA/DSSR, as noted in my 2014-04-11 blogpost Get hydrogen bonds with DSSR.

Both DSSR and SNAP have the --get-hbond option, and they use the same underlying algorithm. However, the default output from the two programs differs: DSSR reports the H-bonds within nucleic acids, and SNAP covers only those at the DNA/RNA-protein interface. Using the PDB entry 1oct as an example, Running DSSR on it with the --get-hbond option gives 33 H-bonds in the DNA duplex, while SNAP reports 38 H-bonds at the DNA-protein interface. By design, the default output caters for the most-common use case of each program.

Under the scene, however, there exist variations in the seemingly simple --get-hbond option. One can attach text ‘nucleic’ (or ‘nuc’, ‘nt’), as in --get-hbond-nucleic, to output H-bonds within nucleic acids. Similarly, --get-hbond-protein (or ‘amino’, ‘aa’) would output H-bonds within proteins. Not surprisingly, the --get-hbond-nt-aa option would list H-bonds in nucleic acids and proteins, including those at their interface. These variations apply to both DSSR and SNAP, even though some are redundant with the default.

Notably, in combination with --json, the --get-hbond option by default would output all H-bonds, as if --get-hbond-nt-aa has been set. For PDB entry 1oct, DSSR or SNAP would report 208 H-bonds. Moreover, the JSON output has a residue_pair field for each identified H-bond, with values like "nt:nt", "nt:aa", or "aa:aa". Using 1oct as an example,

# x3dna-dssr -i=1oct.pdb --get-hbond --json | jq '.hbonds[0]'
  "index": 1,
  "atom1_serNum": 34,
  "atom2_serNum": 608,
  "donAcc_type": "standard",
  "distance": 3.304,
  "atom1_id": "O6@A.DG202",
  "atom2_id": "N4@B.DC230",
  "atom_pair": "O:N",
  "residue_pair": "nt:nt"
# x3dna-dssr -i=1oct.pdb --get-hbond --json | jq '.hbonds[60]'
  "index": 61,
  "atom1_serNum": 462,
  "atom2_serNum": 1187,
  "donAcc_type": "standard",
  "distance": 3.692,
  "atom1_id": "O2@B.DT223",
  "atom2_id": "NH2@C.ARG102",
  "atom_pair": "O:N",
  "residue_pair": "nt:aa"
# x3dna-dssr -i=1oct.pdb --get-hbond --json | jq '.hbonds[100]'
  "index": 101,
  "atom1_serNum": 791,
  "atom2_serNum": 818,
  "donAcc_type": "standard",
  "distance": 2.871,
  "atom1_id": "N@C.THR26",
  "atom2_id": "OD2@C.ASP29",
  "atom_pair": "N:O",
  "residue_pair": "aa:aa"

In the above three cases, using SNAP instead of DSSR would give the same results.

Also, one can take advantage of the residue_pair value to filter H-bonds by type. For example, the following command would extract only H-bonds at the DNA-protein interface (38 occurrences, same as the number noted above):

x3dna-snap -i=1oct.pdb --get-hbond --json | jq '.hbonds[] | select(.residue_pair=="nt:aa")'

Back to the phenix.hbond tool, the author noted that:

Running phenix.hbond requires atomic model in PDB or mmCIF format with all hydrogen atoms added, as well as ligand restraint files if the model contains unknown to the library items.

While there is no particular reason why this should not work for all bio-macromolecules, currently phenix.hbond is only optimized and tested to work with proteins, which is the limitation that will be removed in future.

In contrast, the H-bond identification algorithm in DSSR/SNAP does not require hydrogen atoms. In fact, hydrogen atoms are simply ignored if they exist. As shown above, the H-bond method as implemented in DSSR/SNAP works for DNA, RNA, protein, or their complexes. This does not necessarily mean that the 3DNA way is superior to other similar tools. It just works well in my hand, and it may serve as a pragmatic choice for other users.



DSSR is used in RNAMake and 3dRNA 2.0

Recently I noticed two new citations to DSSR, an integrated software tool for dissecting the spatial structure of RNA. One is from the Yesselman et al. article Computational design of three-dimensional RNA structure and function in Nature Nanotechnology, and the other is from the Wang et al. article 3dRNA v2.0: An Updated Web Server for RNA 3D Structure Prediction in International Journal of Molecular Sciences.

Yesselman et al. has used DSSR in RNAMake for building the motif library. The relevant section is as follows:

We processed each RNA structure to extract every motif with Dissecting the Spatial Structure of RNA (DSSR)54 with the following command:

x3dna-dssr –i file.pdb –o file_dssr.out

We manually checked each extracted motif to confirm that it was the correct type, as DSSR sometimes classifies tertiary contacts as higher-order junctions and vice versa. For each motif collected from DSSR, we ran the X3DNA find_pair and analyze programs to determine the reference frame for the first and last base pair of each motif to allow for the alignment between motifs:

find_pair file.pdb 2> /dev/null stdout | analyze stdin >& /dev/null

It is worth noting the sentence that “DSSR sometimes classifies tertiary contacts as higher-order junctions and vice versa.” Presumably. the authors are referring to the inclusion of ‘isolated canonical pairs’ in junctions by default in DSSR. Overall, the default DSSR settings follow the most common practice in RNA literature. In the meantime, I am aware that the community may not agree on every detail. Thus DSSR provide many options (documented or otherwise) to cater for other potential use cases. See the Stems of junction structure have only one base pair and Junction definition threads on the 3DNA Forum for two examples. In the long run, DSSR is likely to help consolidate RNA nomenclature that can be applied in a pragmatic, consistent manner.

Note also that DSSR provides the reference frame of each identified base pair via the JSON option. Using 1ehz as an example, the following command provides detailed information about base pairs:

x3dna-dssr -i=1ehz.pdb --json --more | jq .pairs

In the 3dRNA 2.0 paper, DSSR is cited as below. This is the first time DSSR is integrated in the 3dRNA pipeline.

The predicted structures are built from the sequence and secondary structure, while the former is obtained from their native structures fetched from PDB (, and the latter is calculated from DSSR (Dissecting the Spatial Structure of RNA) [39].



5CM and 5MC, two forms of 5-methylcytosine in the PDB

In the PDB, the ligand identifiers 5MC and 5CM all refer to 5-methylcytosine, but differ in the sugar moieties the base is attached to. Chemically, 5MC is 5-methyl-2’-deoxycytidine-5’-monophosphate as in DNA, and 5MC is 5-methylcytidine-5’-monophosphate. See the molecular images shown below.

Web 3DNA 2.0 highlighted in the cover of the NAR'19 webserTwo forms of 5-methylcytosine in PDB: 5CM and 5MC

The 5-methyl group is named C5A in 5CM and CM5 in 5MC, respectively, for non-obvious reasons other than conventions. For comparison, the methyl-group in thymine of DNA is named C7, as for example in PDB id 355d. It is worth noting that DSSR is able to handle all such variations in atom or residue names.



R wrapper to DSSR in VeriNA3d

I recently came across a Bioinformatics article VeriNA3d: an R package for nucleic acids data mining by Gallego et al. from IRB Barcelona. VeriNA3d can perform dataset analysis, single-structure analysis, and exploratory data analyses, with an emphasis on complex RNA structures. I am glad to see the DSSR is one of the third-party utilities that have been integrated into VeriNA3d, as shown below

VeriNA3d offers integration with third-party utilities such as the non-redundant lists of RNA structures (Leontis and Zirbel, 2012), the eRMSD suggested to compare RNA structures (Bottaro et al., 2014), a wrapper to the DSSR (Dissecting the Spatial Structure of RNA) software (Lu et al., 2015) and query functions to access the PDBe REST API (Velankar et al., 2016).

I browsed the GitLab repository and read through the supplemental documents. Clearly, VeriNA3d is a handy tool for the R community to perform RNA 3D structural analyses.

To DSSR users, Section “9 The dssr wrapper: getting the base pairs” of the supplemental PDF “VeriNA3d: introduction and use cases” is particularly relevant. The three paragraphs (with minor edits) are excerpted below:

The DSSR software (Dissecting the Spatial Structure of RNA) (Lu, Bussemaker, and Olson 2015) represents an invaluable resource to handle RNA structures. Some of the functions of veriNA3d overlap with the functionalities of DSSR, and both applications provide unique different features. We implement a wrapper to execute DSSR directly from R and get the best of both worlds in one place.

Note that installing veriNA3d does not automatically install DSSR, since we don’t redistribute third-party software. Before any user can use our wrapper, the dssr function, DSSR should be installed separately. To address this installation we redirect you to the DSSR manual, where anyone can find the specific instructions for their system. Once DSSR is installed and working in your computer, you will also be able to use it with our wrapper. If the DSSR executable (named x3dna-dssr) is in your path, dssr will find it automatically. If the wrapper does not find it, you can still use it specifying the absolute path to the executable with the argument exefile. Find more information running ?dssr.

One of the DSSR capabilities that users might be interested in is the detection and classification of base pairs. The following code shows a simple example. The output of the dssr wrapper is an object got from the json DSSR output. From R, json objects are parsed in the form of a tree of lists, with different types of information. Most of the interesting data is under the list models, sublist parameters, as shown herein.

I echo the authors’ policy of not redistributing third-party software with VeriNA3d. DSSR is under active development. Users should always visit the 3DNA Forum for downloading the latest version of DSSR, reporting bugs, and asking questions.

The R interface to DSSR (via JSON output) in VeriNA3d represents one of the intended use cases of DSSR’s many possible applications. No doubt DSSR is being increasingly integrated into other resources of RNA structural bioinformatics. Hopefully, more advanced DSSR features (than the detection and classification of base pairs) will also be widely appreciated in the future. Users would love DSSR better when they gain more experience in structural bioinformatics.



Web 3DNA 2.0 is highlighted in the NAR'19 web server cover

It is a great pleasure to see that our article Web 3DNA 2.0 for the analysis, visualization, and modeling of 3D nucleic acid structures has been highlighted in the cover page of the web server issue of NAR’19. According to the editor, This year, 331 proposals were submitted and 122, or 37%, were approved for manuscript submission. Of those approved, 94, or 77%, were ultimately accepted for publication. Overall, that corresponds to a ~28% acceptance rate.

The cover image and its caption are shown below. Moreover, details on how the cover image was created are available on the 3DNA Forum.

Web 3DNA 2.0 highlighted in the cover of the NAR'19 webserver issue

Caption: Examples of customized molecular models that can be generated with 3DNA: (top) a chromatin-like, nucleosome-decorated DNA with the structures of known histone-DNA assemblies placed at user-defined binding sites; (lower left) molecular schematic of a DNA trinucleotide diphosphate illustrating the base planes and reference frames used to construct and analyze 3D nucleic acid-containing structures; (lower right) customized single-stranded tRNA model built from a user-defined base sequence and a set of rigid-body parameters describing the desired placement of successive bases. Color code of base blocks: A, red; C, yellow; G, green; T, blue; U, cyan.



DNA conformational changes play a force-generating role during bacteriophage genome packaging

A paper titled DNA Conformational Changes Play a Force-Generating Role during Bacteriophage Genome Packaging has just been officially published in the Biophysical Journal (Volume 116, Issue 11, P2172-2180, June 04, 2019). I am glad to have the opportunity to collaborate with Kim Sharp, Gino Cingolani and Stephen Harvey on this interesting project that has big implications in understanding the mechanism of bacteriophage genome packaging. The abstract of the paper is shown below:

Motors that move DNA, or that move along DNA, play essential roles in DNA replication, transcription, recombination, and chromosome segregation. The mechanisms by which these DNA translocases operate remain largely unknown. Some double-stranded DNA (dsDNA) viruses use an ATP-dependent motor to drive DNA into preformed capsids. These include several human pathogens as well as dsDNA bacteriophages—viruses that infect bacteria. We previously proposed that DNA is not a passive substrate of bacteriophage packaging motors but is instead an active component of the machinery. We carried out computational studies on dsDNA in the channels of viral portal proteins, and they reveal DNA conformational changes consistent with that hypothesis. dsDNA becomes longer (“stretched”) in regions of high negative electrostatic potential and shorter (“scrunched”) in regions of high positive potential. These results suggest a mechanism that electrostatically couples the energy released by ATP hydrolysis to DNA translocation: The chemical cycle of ATP binding, hydrolysis, and product release drives a cycle of protein conformational changes. This produces changes in the electrostatic potential in the channel through the portal, and these drive cyclic changes in the length of dsDNA as the phosphate groups respond to the protein’s electrostatic potential. The DNA motions are captured by a coordinated protein-DNA grip-and-release cycle to produce DNA translocation. In short, the ATPase, portal, and dsDNA work synergistically to promote genome packaging.

Significantly, our work is highlighted in a “New and Notable” article, May the Road Rise to Meet You: DNA Deformation May Drive DNA Translocation by Paul Jardine (Volume 116, Issue 11, Pages 2060-2061, 4 June 2019):

Regardless of what drives conformational change in the portal, the idea that the linear DNA substrate is deformed in a way that makes it an energetic participant in its own movement opens new possibilities for how motors work. Large paddling or rotational motions by motor components may not be required if linear motion can be achieved by stretching or compressing the linear substrate, with rectified, cyclic conformational changes in the DNA rather than lever motions doing the work. If borne out by experiments, further simulation, and more structural information, this proposed mechanism may require a reappraisal of how we think about translocating motors.

For this project, I developed the x3dna-search program to survey similar fragments of single-stranded or double helical structures in the PDB.

Comment [2]


The article on G.A pairs in ACS Biochemistry

After many years of efforts, it is a great pleasure to see our paper Effects of Noncanonical Base Pairing on RNA Folding: Structural Context and Spatial Arrangements of G·A Pairs published in ACS Biochemistry. The abstract is shown below:

Noncanonical base pairs play important roles in assembling the three-dimensional structures critical to the diverse functions of RNA. These associations contribute to the looped segments that intersperse the canonical double-helical elements within folded, globular RNA molecules. They stitch together various structural elements, serve as recognition elements for other molecules, and act as sites of intrinsic stiffness or deformability. This work takes advantage of new software (DSSR) designed to streamline the analysis and annotation of RNA three-dimensional structures. The multiscale structural information gathered for individual molecules, combined with the growing number of unique, well-resolved RNA structures, makes it possible to examine the collective features deeply and to uncover previously unrecognized patterns of chain organization. Here we focus on a subset of noncanonical base pairs involving guanine and adenine and the links between their modes of association, secondary structural context, and contributions to tertiary folding. The rigorous descriptions of base-pair geometry that we employ facilitate characterization of recurrent geometric motifs and the structural settings in which these arrangements occur. Moreover, the numerical parameters hint at the natural motions of the interacting bases and the pathways likely to connect different spatial forms. We draw attention to higher-order multiplexes involving two or more G·A pairs and the roles these associations appear to play in bridging different secondary structural units. The collective data reveal pairing propensities in base organization, secondary structural context, and deformability and serve as a starting point for further multiscale investigations and/or simulations of RNA folding.

Sample G.A pair characterized by DSSR

This work represents a multifaceted, fundamental application enabled by DSSR. Even at the base-pair (bp) level, DSSR provides unique features that complement the Leontis-Westhof (LW) notation of 12 geometric types.

At the review stage, we were asked by a referee to comment on the differences between DSSR and LW on bp classifications. The following paragraph in the “DISCUSSION” section of the paper is our response, expanded on the original writing that focused on DSSR’s capabilities:

Qualitative descriptions of noncanonical RNA base pairing, pioneered by Leontis and Westhof9,41 and linked in this work to the rigid-body parameters of interacting bases, have proven valuable in deciphering the connections between RNA primary, secondary, and tertiary structures. The present categorization is based on the positions of the hydrogen-bonded atoms with respect to a standard, embedded base reference frame30 defined in terms of an idealized Watson−Crick base pair. The major- and minor-groove base edges used here correspond in most cases to what are termed the Hoogsteen and sugar edges in the Leontis−Westhof scheme (one can compare the two classification schemes in Table S2). The + and − symbols introduced in 3DNA24 and DSSR27 unambiguously distinguish the relative orientations of the two bases. The trans and cis designations used in the earlier literature, however, are qualitative in nature and often uncertain. There are many “nc” (near cis, as in ncWW) and “nt” (near trans, as in ntSH) annotations listed in the RNA Structure Atlas; see, for example, the base-pair interactions in the sarcin−ricin domain of E. coli 23S rRNA found by entering PDB entry 1msy at The assignment of qualitative descriptors of RNA associations on the basis of atomic identity alone is generally not clear-cut. Numerical differences in the rigid-body parameters are critical to differentiating pairing schemes that share a common hydrogen bond, e.g., the G(N3)···A(N6) interaction found in m−WII and m−MI arrangements of G and A (Table 1 and Figures 4 and S3). The numerical data also provide a basis for following conformational transitions and may potentially be of value in making functional and other meaningful distinctions among RNA base pairs.

See also a recent thread Noncanonical base pair standards on the 3DNA Forum and the section titled “3.2.2 Base pairs” in the DSSR User Manual.



Web 3DNA 2.0 paper published in NAR

It is a great pleasure to announce the publication of Web 3DNA 2.0 for the analysis, visualization, and modeling of 3D nucleic acid structures in Nucleic Acids Research (NAR). The paper will appear in the web server issue of NAR in July 2019. At nine-page in length and with several new structural parameters, this w3DNA 2.0 paper is certainly not a typical NAR web-server publication. It represents a significant contribution to the field of 3D nucleic acids structural bioinformatics, and will undoubtedly push the popularity of 3DNA to a new level.

The abstract is shown below:

Web 3DNA (w3DNA) 2.0 is a significantly enhanced version of the widely used w3DNA server for the analysis, visualization, and modeling of 3D nucleic-acid-containing structures. Since its initial release in 2009, the w3DNA server has continuously served the community by making commonly-used features of the 3DNA suite of command-line programs readily accessible. However, due to the lack of updates, w3DNA has clearly shown its age in terms of modern web technologies and it has long lagged behind further developments of 3DNA per se. The w3DNA 2.0 server presented here overcomes all known shortcomings of w3DNA while maintaining its battle-tested characteristics. Technically, w3DNA 2.0 implements a simple and intuitive interface (with sensible defaults) for increased usability, and it complies with HTML5 web standards for broad accessibility. Featurewise, w3DNA 2.0 employs the most recent version of 3DNA, enhanced with many new functionalities, including: the automatic handling of modified nucleotides; a set of ‘simple’ base-pair and step parameters for qualitative characterization of non-Watson–Crick double- helical structures; new structural parameters that integrate the rigid base plane and the backbone phosphate group, the two nucleic acid components most reliably determined with X-ray crystallography; in silico base mutations that preserve the backbone geometry; and a notably improved module for building models of single-stranded RNA, double- helical DNA, Pauling triplex, G-quadruplex, or DNA structures ‘decorated’ with proteins. The w3DNA 2.0 server is freely available, without registration, at

Moreover, details on reproducing our reported results are available in a dedicated section ‘web 3DNA 2.0 (’ on the 3DNA Forum.

Graphical abstract of web 3DNA 2.0



3DNA blocview image in the cover of the RNA journal

While browsing the June 2019 issue of the RNA journal, I was surprised to see a cover image with familiar schematic representations:

Crystal structure of ykoY-mntP riboswitch chimera bound to cadmium

The caption is as below:

Crystal structure of ykoY-mntP riboswitch chimera bound to cadmium (Protein Data Bank code: 6cc3; Bachas ST, Ferré-D’Amaré AR. 2018. Convergent use of heptacoordination for cation selectivity by RNA and protein metalloregulators. Cell Chem Biol 25: 962–973.e5). The RNA backbone is displayed as a red ribbon; bases are shown as blocks with NDB coloring: A—red, C—yellow, G—green, U—cyan; cadmium ions are shown as red spheres. The image was generated using 3DNA/blocview and PyMol software. Cover image provided by the Nucleic Acid Database (

In addition to the blocview script distributed with 3DNA v2.x, the block-view has been integrated into DSSR via the --blocview option. Notably, the DSSR-plugin introduces the dssr_block command to PyMOL for interactive visualization of nucleic acid structures. See the DSSR User Manual for more information.



DSSR is used for the analysis of CRISPR PbuCas13b-crRNA

Recently I read with great interest the article High-Resolution Structure of Cas13b and Biochemical Characterization of RNA Targeting and Cleavage by Slaymaker et al., published in Cell Reports (2019, 26, 3741–3751). This 1.65-Å structure (PDB id: 6dtd) “provides a mechanistic model for Cas13b target RNA recognition and identifies features responsible for target and cleavage specificity.”

I am pleased to see that DSSR is listed in the “KEY RESOURCES TABLE” under the category “Software and Algorithms”, and mentioned in the “Structure Analysis” section:

RNA structure was analyzed using DSSR (Lu et al., 2015). Protein conservation mapping to the structure was done using the Consurf server (Ashkenazy et al., 2016). Protein secondary structure was analyzed using the PDBSUM webserver (de Beer et al., 2014) (Figure S1E). APBS as part of the PyMOL visualization program was used to calculate electrostatics (Jurrus et al., 2018). Structure validation statistics were generated with MolProbity (Chen et al., 2010)

In the main text, the authors cited DSSR for the detection of a base multiplet. Running DSSR on PDB entry 6dtd, I found two base triplets, as shown below:

In the figure above, each of the two adenines is interacting with a G–C pair in the minor-groove edge (m) of the pair: A30 (left) is using its Watson-Crick edge (W), whilst A23 (right) is employing its major-groove edge (M). Thus they do not belong to the canonical A-minor motifs (types I or II) where the minor-groove edge of A interacts with the minor-groove edge of a WC pair. In DSSR, they are classified as type=X, a general category of noncanonical A-minor motifs.



DSSR on PDB entry 6neb

Via PDB weekly update, I recently came across PDB entry 6neb, which is solved by NMR and described as an “MYC promoter G-quadruplex with 1:6:1 loop length”. I downloaded the atomic coordinates of the entry and ran DSSR on it. Indeed, DSSR readily identifies a three-layered parallel G-quadruplex (G4) with three propeller-type loops of 1, 6 and 1 nucleotides (i.e., 1:6:1), as shown below.

List of 1 G4-stem
  Note: a G4-stem is defined as a G4-helix with backbone connectivity.
        Bulges are also allowed along each of the four strands.
  stem#1[#1] layers=3 INTRA-molecular loops=3 descriptor=3(-P-P-P) note=parallel(4+0) UUUU parallel
   1  glyco-bond=---- groove=---- WC-->Major nts=4 GGGG A.DG3,A.DG7,A.DG16,A.DG20
      pm(>>,forward)  area=14.54 rise=3.36 twist=24.7
   2  glyco-bond=---- groove=---- WC-->Major nts=4 GGGG A.DG4,A.DG8,A.DG17,A.DG21
      pm(>>,forward)  area=9.67  rise=3.48 twist=30.3
   3  glyco-bond=---- groove=---- WC-->Major nts=4 GGGG A.DG5,A.DG9,A.DG18,A.DG22
    strand#1  U DNA glyco-bond=--- nts=3 GGG A.DG3,A.DG4,A.DG5
    strand#2  U DNA glyco-bond=--- nts=3 GGG A.DG7,A.DG8,A.DG9
    strand#3  U DNA glyco-bond=--- nts=3 GGG A.DG16,A.DG17,A.DG18
    strand#4  U DNA glyco-bond=--- nts=3 GGG A.DG20,A.DG21,A.DG22
    loop#1 type=propeller strands=[#1,#2] nts=1 A A.DA6
    loop#2 type=propeller strands=[#2,#3] nts=6 TTTTAA A.DT10,A.DT11,A.DT12,A.DT13,A.DA14,A.DA15
    loop#3 type=propeller strands=[#3,#4] nts=1 T A.DT19

I then read the associated paper titled Solution Structure of a MYC Promoter G-Quadruplex with 1:6:1 Loop Length lately published in the new, open-access ACS Omega journal. The reported structure 6neb has a 27-nt sequence (termed Myc1245) of bases 5’-TTGGGGAGGGTTTTAAGGGTGGGGAAT-3’. Myc1245 is based on the 27-nt long, purine-rich MycPu27 which has 5 tracts of guanines of G4-forming motif within the MYC promoter. In Myc1245, the third G-tract of MycPu27 has been replaced by TTTA, thus it uses only G-tracts 1, 2, 4, 5 for G4 formation. Previously, it was shown that Myc2345 (using G-tracts 2-5 of MycPu27) adopts a parallel G4 structure with three propeller loops of 1:2:1 nt length.

The MycPu27 sequence is representative of the G4-forming nuclease hypersensitive element (NHE III1) within the promoter region of the MYC oncogene. Formation of G4 structures suppresses MYC transcription, thus ligand-induced G4 stabilization in the DNA level is a promising strategy for cancer therapy. The NHE III1 motif can fold into multiple G4 structures depending on factors such as protein binding. The paper on 6neb illustrates that nucleolin, a protein shown to bind MYC G4 and repress MYC transcription, preferably binds the 1:6:1 loop length conformer than the 1:2:1 conformer (the major form under physiological conditions).

The DSSR analysis of 6neb shows that the two G-tetrad steps have different overlapping areas and twist angles. The top step comprising G3 and G4 (Fig. 1A) has better stacking interactions (14.5 Å2) and smaller twist (25º) than the bottom step containing G4 and G5 (9.7 Å2 and 30º, respectively).

area=14.54 rise=3.36 twist=24.7
area=9.67  rise=3.48 twist=30.3

The analysis characterizes the T1–A15 pair as a reverse Hoogsteen pair (rHoogsteen), which is distinct from the T+A Hoogsteen pair. In DSSR, the rHoogsteen pair is of type M–N (anti-parallel), whilst the Hoogsteen pair is of type M+N (parallel). With the local base-reference frames attached (Fig. 1B), it is easy to visualize that the z-axis of T1 is pointing out of the base-pair plane, and the z-axis of A15 is pointing inwards. See also my blog post Hoogsteen and reverse Hoogsteen base pairs.

Fig. 1C shows the ATG-triad automatically identified by DSSR. As is clear in Fig. 1A, the ATG-triad stacks on the 3-layered G4 structure on the 5’ side. Moreover, with color-coded base blocks (G in green, T in blue, and A in red), the two stacks (T10–T11, and T12–T13–A14) in the 6-nt central propeller loop is immediately obvious.

Figure 1. DSSR-derived structural features in PDB entry 6neb. The images were created using DSSR and PyMOL.

In the 6neb paper, the author stated that “The central loop of 6 nt connects the outer tetrads by spanning the G-core. A single nucleotide is the minimal length of this structural motif, so the five additional residues can significantly increase the loop’s conformational flexibility.” (p.2536) It is worth noting that in PDB entry 2m53, described in G-rich VEGF aptamer with locked and unlocked nucleic acid modifications exhibits a unique G-quadruplex fold, it was observed that:

An unprecedented all parallel-stranded monomeric G-quadruplex with three G-quartet planes exhibits several unique structural features. Five consecutive guanine residues are all involved in G-quartet formation and occupy positions in adjacent DNA strands, which are bridged with a no-residue propeller-type loop.

The G4 structure is polymorphic. It seems every imaginable or even unexpected form is possible, depending on the context.



Non-G base tetrads

In addition to the well-known G-tetrad serving as the building block of G-quadruplexes (G4), other types of homogeneous or heterogeneous base-tetrads are also possible. In DSSR, all these base tetrads are generally termed multiplets where three or more bases associate in a co-planar fashion via H-bonding interactions.

In the context of G4 structures, U-tetrads are the most common. Fig. 1A shows an example of U-tetrads in PDB entry 4rne reported in the paper titled Structural Variations and Solvent Structure of r(UGGGGU) Quadruplexes Stabilized by Sr2+ Ions.. In the structure (Fig. 1B), two terminal U-tetrads cap the six-layered G4 structure in the middle. The four U’s in the U-tetrad are paired in parallel orientation (i.e., U+U), just as the G+G pairs in the G-tetrad of G4 structures (Fig. 1C). On the other hand, there is only one H-bond (O4…N3) in the U+U pair of the U-tetrad, in contrast to the two H-bonds in the G+G pair of the G-tetrad (Fig. 1C). In the PDB entry 4rne, DSSR also detects two octads where the middle G-tetrad is surrounded by four U’s in anti-parallel orientation (G’s filled in green vs. U’s empty, see Fig. 1C for an example).

Similarly orientated C-tetrad (C+C pair, Fig. 1D) or A-tetrad (A+A pair, Fig. 1E) are also possible. PDB entry “6a85”, associated with the paper High-resolution DNA quadruplex structure containing all the A-, G-, C-, T-tetrads., reported a high-resolution crystal structure of sequence 5'-AGAGAGATGGGTGCGTT-3' which contains all the homogeneous A-, G-, C-, T-tetrads, and the heterogeneous A:T:A:T tetrads. As of this writing (Feb. 19, 2019), the status for the PDB entry “6a85” is still “HPUB” (‘processing complete, entry on hold until publication’) even though the paper was published several months ago. Using mutate_bases in 3DNA, I generated a C-tetrad and an A-tetrad as shown in Fig. 1D and 1E. As the U-tetrad (Fig. 1A), the C- and A-tetrads also have only one H-bond in their M+N type pairs. The G-tetrad, with two H-bonds in its connecting pairs, is more stable than the other homogeneous base tetrads, leading to wide-spread G4 structures.

In the homogeneous base tetrads shown in Fig.1A-E, pairs are of the parallel M+N type and the bases are associated via their Watson-Crick and major-groove (Hoogsteen) edges. Two canonical (WC or G—U wobble) pairs can also associate via their minor-groove edges, as seen in PDB entries 2hk4 and 2lsx. Fig. 1F gives an example with two G—U wobble pairs (of anti-parallel M—N type, filled U in blue vs. empty G) in PDB entry 2lsx reported in the paper titled A minimal i-motif stabilized by minor groove G:T:G:T tetrads..

DSSR-derived non-G base tetrads
Figure 1. Non-G base tetrads automatically identified or modeled by 3DNA-DSSR. The images were created using DSSR and PyMOL.



G-tetrad and pseudo G-tetrads

A G-quadruplex (G4) is composed of stacks of G-tetrad where four guanines form four G•G pairs in a circular, planar fashion. Specifically, the G•G pairs of the G-tetrad (see Fig. 1A below) in G4 are of type M+N according to 3DNA/DSSR: i.e., G+G with the local z-axes of pairing guanines in parallel. Moreover, the G+G pair is uniquely quantified by three base-pair parameters: Shear, Stretch, and Opening with mean values [+1.6 Å, +3.5 Å, –90º] or [–1.6 Å, –3.5 Å, +90º], corresponding to the cWH (cW+M) or cHW (cM+W) types of LW (DSSR) classifications, respectively. This pair is numbered VI in the list of 28 base pairs with two or more H-bonds between base atoms, compiled by Saenger.

In addition to the standard G-tetrad configuration as normally seen in G4 structures, a so-called pseudo-G-tetrad form (see Fig. 1B below) is reported in a 2013 paper titled Duplex-quadruplex motifs in a peculiar structural organization cooperatively contribute to thrombin binding of a DNA aptamer. (PDB entry 4i7y). In a 2017 publication from the same group, Through-bond effects in the ternary complexes of thrombin sandwiched by two DNA aptamers, another form of pseudo-G-tetrad (Fig. 1C) is found in PDB entries 5ew1 and 5ew2.

Clearly, pseudo-G-tetrads are very different from the normal G-tetrad, in terms of base pairing patterns. The G-tetrad is highly regular with the same type of G+G pairs, with the O6 atoms pointing to the middle of the circle. The two pseudo-G-tetrads are less regular, and they differ from each other as well, by flipping G12 from syn (Fig. 1B) to trans (Fig. 1C).

These distinctions stand out even more by filling the up-face (+z-axis outwards) of a guanine base in green while leaving the down-face (+z-axis inwards) empty (G5 in Fig. 1B, G5 and G12 in Fig. 1C). So in G-tetrad (Fig. 1A), all four guanines have their positive z-axis point towards the viewer, corresponding to all four G+G pairs. In one pseudo-G-tetrad (Fig. 1B), G5 has its positive z-axis pointing away from the viewer. So G5–G7 and G5–G16 pairs are of the M–N type. The other type of pseudo-G-tetrad (Fig. 1C) has the opposite orientation for G12. Finally, Fig. 1D shows schematically PDB entry 4i7y where the G-tetrad and a pseudo-G-tetrad are directly stacked, creating a two-layered pseudo-G-quadruplex.

DSSR-derived G-tetrads
Figure 1. (A) G-tetrad, (B-C) two types of pseudo-G-tetrads, and (D) the complex of a DNA-apatmer with thrombin. G-tetrads were automatically identified by 3DNA-DSSR. The images were created using DSSR and PyMOL.



Starting DNA or RNA structures

A starting structure of suitable sequence is a prerequisite for many applications, including downstream use in X-ray crystallography, NMR, and molecular dynamics (MD) simulations. Browsing through the literature, I’ve noticed the following tools for such a purpose.

  • The make-na server, a web-based automated tool for making nucleic acid helices powered by NAB. It supports abasic sites via the underscore character (_). According to the help page, “The structure file represents the abasic as the 3-letter code ‘3DR’ in DNA strands and ‘ N’ in RNA strands. These are Protein Data Bank conventions.” An example input is shown below:
  • The NAB itself, and the standalone fd_helix.c program which supports 6 fiber-based models of DNA or RNA.
  • The NUCGEN program from the Bansal group. “The NUCGEN software generates double helical models with the backbone fixed in B-form DNA, but with appropriate modifications in the input data, it can also generate A-form DNA and RNA duplex structures.”
  • 3DNA and its web interface. The ‘rebuild’ program can be used for constructing customized, single or duplex DNA/RNA structures based on a set of base-pair and step (helical) parameters. Moreover, the sugar-phosphate backbone in A-, B- or RNA conformation is allowed. The ‘fiber’ program incorporates a comprehensive list of 56 regular models, based mostly on fiber diffraction data. The list includes single, duplex, triplex, quadruplex, DNA, RNA structures or their hybrids. Notable, the classic Pauling’s triplex model is also available. The 3DNA web 2.0 makes these model-building features readily accessible to a large user base.

Overall, each of the tools listed above has its unique features and may fit better for different applications. It is to the benefit of the user community to have a choice.



misc tips and tricks

Nowadays, I’ve been used to google searches as a quick way to solve problems. Once in a while, I come across a tip or trick that fixes an issue at hand and then move on. However, I may late on meet a similar problem, but only vaguely remember how I solved it previously. So I’d need to start googling around again. This list is a remedy for such situations, and it will be continuously updated. While the list is created for my own reference, it may also be useful to other viewers of the post, presumably reaching here via google.

  • icdiff — show diff with color
  • scc — strip C comments
  • Taskwarrior (taks) — manage TODO list from terminal
  • httpie as a replacement of curl and wget
  • byebug and pry for debugging Ruby
  • ag to search for PATTERN in source files, replacing grep
  • fd to find files and directories
  • bat to view files with syntax highlighting (in place of cat)
  • exa as an alternative to ls
  • bench to benchmark code
  • asciinema and svg-term to record terminal activity as an SVG animation. Another option is termtosvg. Moreover, the trio ttyrec, ttyplay, ttygif can record, play terminal screen recordings, and convert it into smooth GIF
  • wrk to benchmark HTTP APIs
  • hub — git wrapper for GitHub
  • tail -n +2 to skip the first line (starting from the second line)
  • sudo -i -u user_id, the -i or --login option invokes login shell
  • Understanding Shell Script’s idiom: 2>&1 — redirect ‘stderr’ to ‘stdout’ via ’2>&1’ in bash shell.
  • Ruby one-liners
    • ruby -pi.bak -e "gsub(/SOME_PATTERN/, 'other_text')" files for global replacement of SOME_PATTERN by other_text in files
    • ruby -pe 'gsub(/_/, ".")' globally replace ‘_’ with ‘.’



Identification of nucleotides

Nucleic acids structural bioinformatics starts with the identification of nucleotides (nts) from atomic coordinates. As biopolymers, RNA and DNA have standard IUPAC names of atoms for the five bases (see the Figure below), sugars (ending with prime, e.g., C1’, O2’), and the phosphate (P, OP1, and OP2). The atomic coordinates (in PDB or mmCIF format) from the Protein Data Bank (PDB) follow the convention.

Standard bases, with names

Trained as a chemist, I am aware that the bases are aromatic, heterocyclic compounds (purines and pyrimidines). Moreover, the five standard bases (A, C, G, T, and U) also share a six-membered ring, with atoms named consecutively (N1, C2, N3, C4, C5, C6). This special feature can be employed to identify nts automatically, from PDB atomic coordinates. The ring skeleton is not influenced by protonation states, tautomeric forms, or modifications in base, sugar or phosphate. Early versions of 3DNA (up to v2.0) used only N1, C2, and C6 atoms to identify an nt: an additional N9 as purine, otherwise as pyrimidine. In 3DNA v2.3 and DSSR, the procedure has been refined to take advantage of all available rings atoms. It is thus more robust against distortions and still works even when any of the N1, C2, C6, or N9 atoms are mutated or missing. This blog post provides further technical details on how the method works.

The template used to identify nts is a purine, with nine base ring atoms. Purine is chosen since it contains atoms of the six-membered ring and N7, C8, and N9. Its atomic coordinates in PDB format are shown below. The coordinates are taken from ‘G’ in the standard reference frame ($X3DNA/config/Atomic_G.pdb). Using ‘A’ as reference won’t make any difference since the RMSD between them is only 0.038 Å.

ATOM      1  N9    G A   1      -1.289   4.551   0.000  1.00  0.00           N
ATOM      2  C8    G A   1       0.023   4.962   0.000  1.00  0.00           C
ATOM      3  N7    G A   1       0.870   3.969   0.000  1.00  0.00           N
ATOM      4  C5    G A   1       0.071   2.833   0.000  1.00  0.00           C
ATOM      5  C6    G A   1       0.424   1.460   0.000  1.00  0.00           C
ATOM      6  N1    G A   1      -0.700   0.641   0.000  1.00  0.00           N
ATOM      7  C2    G A   1      -1.999   1.087   0.000  1.00  0.00           C
ATOM      8  N3    G A   1      -2.342   2.364   0.001  1.00  0.00           N
ATOM      9  C4    G A   1      -1.265   3.177   0.000  1.00  0.00           C

The nt-identification process begins with a mapping of at least three atoms in the purine, followed by a least-squares fit between corresponding atoms. For the five standard bases and most modified ones, the RMSD is normally less than 0.12 Å, as seen in the Figure below. Even the unsaturated dihydrouridine in tRNA has an RMSD of less than 0.25 Å: for the yeast phenylalanine tRNA (PDB id: e1ehz), for example, it is 0.205 Å for H2U-16, and 0.226 Å for H2U-17. DSSR uses a cutoff of 0.28 Å, covering essentially all nucleotides in the PDB. As an extreme case, the DA1 residue on chain T of PDB id 4ki4 has only three base atoms: N7, C8, and N9 (i.e., no atoms from the six-membered ring). With an RMSD of only 0.005 Å, DSSR still takes it as an nt, properly assigned as ‘A’.

Molecular dynamics (MD) simulations sometimes produce heavily distorted bases, which is over the default cutoff. Users may change the cutoff to a larger value to accommodate such unusual cases.

Nucleotide identification in 3DNA-DSSR

In addition to dihydrouridine, the above Figure also shows pseudouridine (PSU), 1-methyladenosine (1MA), 4-thiouridine (4SU), and the heavily modified YYG in tRNA. They are all easily identified using the same scheme. Since the nt-identification method concentrates on base rings, modifications in sugar or the phosphate group do not pose any problem. For example, in tRNA 1ehz, DSSR also identifies O2’-methylguanosine (OMG) and O2’-methylcytidine (OMC) as modified nts.

Two special cases worth mentioning. The ligand IMD in PDB id 1r8e has a five-membered ring. Its atoms are named similarly to those of an nt, and the fitted RMSD is only 0.29 Å. IMD can be filtered out by its missing of the C6 atom and having an N1—C5 covalent bond. The ligand SPM in PDB id 355d is a linear molecule, and its RMSD (1.86 Å) is clearly far off to be taken as an nt.

Another particular case (of a different kind) is the abasic sites, especially in X-ray crystal structures in the PDB. By definition, abasic sites do not have base atoms available. Thus the described method is not applicable to their characterization as nts. As of v1.7.3-2017dec26, however, DSSR has also incorporated abasic sites into the analysis pipeline, by default. The program checks backbone linkage and residue name for appropriate nt assignment. The abasic sites could constitute part of (internal) loops which would otherwise be broken into segments by DSSR.

Overall, I feel confident to say that 3DNA-DSSR has practically solved the problem of identifying nts from atomic coordinates. The method detailed herein (and outlined in the DSSR paper) is simple and easy to understand/implement. Moreover, it has been extensively tested in real-world applications for well over a decade. I’ve yet to find a single case where it does not work as expected.



Mutations to 3-methyladenine

Recently, a 3DNA user asked on the Forum about how to perform mutations to 3-methyladenine. The user reported that the procedure described in the FAQ entry How can I mutate cytosine to 5-methylcytosine did not work for the case of 3-methyladenine. This ‘limitation’ is easily understandable: the 3DNA mutate_bases program must have knowledge of the target base, 3-methyladenine, to perform the mutation properly. The program works for the most common 5-methylcytosine mutations since the corresponding 5MC file (Atomic_5MC.pdb, in the standard base-reference frame) is already included within the 3DNA distribution. By supplying a similar file for the target base, mutate_bases runs the same for mutations to 5-methylcytosine (or other bases). This blog post outlines the procedure, using 3-methyladenine as an example.

A ligand name search for 5-methylcytosine on the RCSB PDB led to only two matched entries: 2X6F and 3MAG. The ligand id is 3MA. Since 3MAG has a better resolution (1.8 Å) than 2X6F (3.3 Å), its 3MA ligand was extracted from the corresponding PDB file (3MAG.pdb). The atomic coordinates, excluding those for the two hydrogens, are as below. Note that the 3-methyl carbon atom is named CN3.

HETATM 2960  N9  3MA A 600      16.587  14.258  22.170  1.00 49.87           N
HETATM 2961  C4  3MA A 600      17.123  13.100  21.622  1.00 50.46           C
HETATM 2962  N3  3MA A 600      16.877  11.811  22.009  1.00 50.37           N
HETATM 2963  CN3 3MA A 600      15.983  11.363  23.063  1.00 50.41           C
HETATM 2964  C2  3MA A 600      17.590  10.968  21.241  1.00 50.11           C
HETATM 2965  N1  3MA A 600      18.422  11.217  20.224  1.00 49.27           N
HETATM 2966  C6  3MA A 600      18.627  12.484  19.858  1.00 48.99           C
HETATM 2967  N6  3MA A 600      19.426  12.709  18.829  1.00 46.12           N
HETATM 2968  C5  3MA A 600      17.949  13.503  20.593  1.00 49.89           C
HETATM 2969  N7  3MA A 600      17.929  14.900  20.488  1.00 49.84           N
HETATM 2970  C8  3MA A 600      17.113  15.286  21.434  1.00 49.58           C

After running the 3DNA utility program std_base with options -fit -A, the corresponding atomic coordinates of 3MA are transformed to the standard base reference frame of adenine. The file must be named Atomic_3MA.pdb, and it has the following contents:

HETATM    1  N9  3MA A   1      -1.287   4.521   0.006  1.00 49.87           N
HETATM    2  C4  3MA A   1      -1.262   3.133   0.004  1.00 50.46           C
HETATM    3  N3  3MA A   1      -2.337   2.286  -0.009  1.00 50.37           N
HETATM    4  CN3 3MA A   1      -3.743   2.648  -0.047  1.00 50.41           C
HETATM    5  C2  3MA A   1      -1.905   1.013   0.001  1.00 50.11           C
HETATM    6  N1  3MA A   1      -0.662   0.520   0.004  1.00 49.27           N
HETATM    7  C6  3MA A   1       0.366   1.372  -0.003  1.00 48.99           C
HETATM    8  N6  3MA A   1       1.588   0.867  -0.034  1.00 46.12           N
HETATM    9  C5  3MA A   1       0.068   2.768   0.003  1.00 49.89           C
HETATM   10  N7  3MA A   1       0.875   3.914  -0.003  1.00 49.84           N
HETATM   11  C8  3MA A   1       0.026   4.909  -0.003  1.00 49.58           C

Note that in file Atomic_3MA.pdb, (1) the z-coordinates of the base atoms are close to zeros, (2) the ordering of atoms is as in the original ligand of 3MA shown above.

With Atomic_3MA.pdb in place (in the current working directory, or the $X3DNA/config folder), one can perform 3-methyladenine mutations using mutate_bases. For illustration purpose, let’s generate a B-form DNA with base sequence GACATGATTGCC using the 3DNA fiber program:

fiber -seq=GACATGATTGCC fiber-BDNA.pdb

To mutate A7 to 3MA, one needs to run mutate_bases as following:

mutate_bases "chain=A s=7 m=3MA" fiber-BDNA.pdb fiber-BDNA-A7to3MA.pdb

The result of the mutation is shown in the figure below. Note that the backbone has identical geometry as that before the mutation, and the mutated 3MA-T pair has exactly the same parameters (propeller/buckle etc) as the original A-T. These are the two defining features of the 3DNA mutate_bases program.

3DNA 3-methyladenine mutation

Please see the thread mutations to 3-methyladenine on the 3DNA Forum to download files fiber-BDNA.pdb and fiber-BDNA-A7to3MA.pdb.



DSSR in the visualization of DNA/RNA structures

By following DSSR citations, I recently came across the article Interactive Visualization of RNA and DNA Structures by Lindow et al. The paper introduced a DNA/RNA visualization tool that integrates 1D sequence, 2D secondary structure in linear and graph representations, and 3D backbone ribbons and base ladders, all in one package. Notably, the 3D visualization was tailored for DNA/RNA structures and achieved quite impressive results. A nice feature of the 2D graph representation is the handling of multiple chains.

Reading through the main text and the supplementary material, I was surprised to see the so many locations where DSSR was mentioned, especially the following:

Our approach detects all standard and many modified nucleotides as well as the most common base pairs. Further special cases could be easily added. Yet, the system we developed should not be seen as a replacement for well established tools like DSSR. Rather, it shows what can be achieved with modern techniques in terms of both computation and rendering.

Overall, DSSR is an analysis/annotation tool that is supposedly agnostic to visualization programs. It derives a huge number of structural features that are unlikely to be matched elsewhere. I collaborated with Bob Hanson so that Jmol can directly take advantage of what DSSR has to offer, not just for the visualization of (modified) nucleotides and some common base pairs, but also the interactive selection of loops, pseudoknots, coaxial stacks, and various motifs. In particular, the SQL-like selection syntax Bob developed is really flexible and extremely powerful. I collaborated with Thomas Holder so that PyMOL can gain DNA/RNA domain knowledge. The resultant dssr_block PyMOL plugin is quite useful for creating base/base-pair block images with many revealing features, especially for small to medium-sized DNA/RNA structures. It is obvious to me that PyMOL (or any other molecular visualization tool) would benefit greatly from SQL-like selections of DSSR-derived features of nucleic acid structures, just as Jmol does.

In the Lindow et al. paper, some of the references to DSSR are technical in nature. Here, I’d like to respond and clarify each of them. Since DSSR is being actively developed and supported, I always welcome any feedback on the 3DNA Forum. Following and responding to literature represents another way that I strive to make DSSR a better tool to serve the community.

Built on their experience from 3DNA, Lu et al. developed DSSR [27], a very powerful tool to analyze RNA structures that uses Jmol for the 3D visualization. Recently, Hanson and Lu described this integration [10], which is based on a JSON-interface that directly couples DSSR and the 3D visualization of Jmol. This is a great improvement, but still missing is the integration of 2D secondary structure visualizations and brushing & linking techniques to enable simple selection with and exploration of the 3D molecular structure. One contribution of this paper is to show how a full linking between 3D and 2D visualizations can be done and what benefits arise from such a tight coupling (see Sects. 8 and 9).

This is a valid point, and the authors did a good job. Actually, one of the reviewers of our DSSR-Jmol paper brought up this point, and we acknowledged the limitation. While passing DSSR-derived secondary structural features (in DBN or .ct format) to a 2D visualization tool is straightforward, the connection would not be as smooth as we’d like it to be.

For this purpose, other approaches rely on the unique naming and ordering of the atoms [27], for example, N1, C2, N3, C4, C5, C6 etc. We found that this information is not always reliable.

The naming of the purines and pyrimidines follows the IUPAC standard and is a prerequisite of DNA/RNA structures in the PDB. In my experience, I have never found a single case where such information is not reliable. See below for abasic sites in PDB id 3BWP, and 4SU (4-thiouridine) in PDB id 5AFI.

We compared these results with the latest version of DSSR [27]. Our approach is able to correctly detect all regular nucleotides and most of the modified and undefined nucleotides. In the following, we describe the minor differences.

It is not clear what was the “latest” version of DSSR that was actually used in the paper. Note that DSSR has version info as in v1.8.3-2018oct29. I deliberately put the release date along with the version number.

For dataset 4RGE, we detected 3 modified uracil nucleotides that were not labeled as modified by DSSR. These nucleotides have a DNA backbone instead of an RNA one.

DSSR takes A, C, G, T, U as standard nucleotides, even if T is in RNA or U is in DNA. So this result is expected.

Dataset 3BWP contains 7 nucleotides that only consist of the backbone part without bases. While our approach marks these as undefined, in DSSR they are not detected at all.

The 7 nucleotides on 3BWP are abasic sites, i.e., without base atoms (N1, C2, N3, C4, C5, C6 etc), so they do not possess base reference frames. From early on, DSSR had the --abasic option for such cases. As of v1.7.3-2017dec26, DSSR directly incorporated abasic sites into the analysis. So thereafter they are detected by DSSR, by default.

Furthermore, in 5AFI we mark 3 nucleotides as undefined, while these are detected as a modified uracil by DSSR. This is due to the base containing sulfur instead of oxygen, so they possibly are sulfur analogs of uracil.

Presumably, the authors are referring to 4SU, 4-thiouridine, clearly a modified nucleotide occurring in 137 PDB entries (as of 2018oct28). DSSR detects three cases in 5AFI, as shown here: 4SU-u 3 v.4SU8,w.4SU8,y.4SU8

We also compared the results of our base pair detection (Suppl. Tab. 1). We determined all Watson-Crick, Hoogsteen, and Wobble pairs, and the reverse versions of the first two. For most of the datasets, our method returned the same results as DSSR. In particular, both approaches never created contradicting results, which means all common base pairs had identical pair type. In general, our geometrical approach generates slightly more base pairs compared to DSSR. However, when investigating both, the base pairs determined by DSSR but not by our approach and vice versa, we found that most of these pairs are border-line cases, where the decision was made depending on the threshold of the geometrical heuristic. Only in a few cases, the differences were not clear for both approaches, see Suppl. Fig. 3.

In Suppl. Fig. 3,

… However, the hydrogen bonds for classical G-U Wobble pairs seem to be quite unrealistic for the bottom left pair. Either this is a limitation of DSSR or it is some kind of specific Wobble pair with other hydrogen bonds than the depicted ones that our approach does not detect.

I echo the point that border-line cases could cause discrepancies between different methods. However, things can get easily clarified in concrete examples. Unfortunately, the authors did not specify the cases used in their Suppl. Fig. 3. I finally figured out the DSSR-assigned G-U Wobble pair in PDB id 1S72, U2586—G2592. As shown in the figure below, DSSR detects two H-bonds (dashed pink lines), "N3(imino)*N2(amino)[3.05],O4(carbonyl)-N1(imino)[2.77]". Note that one of the H-bonds is between two donors, N3(U) and N2(G), thus the * symbol. The H-bonds are by no means as those in “classical G-U Wobble pairs”. Yet, the pair is clearly Wobble-like, and that’s why it was assigned “Wobble”. To avoid such confusions, I’ve revised DSSR to tighten the criteria of G-U Wobble pair. As of v1.8.3-2018oct29, this pair is called "~Wobble".

DSSR-assigned ~Wobble pair]

Nevertheless, our evaluation (Sect. 8.1) shows that with the proposed approaches in terms of quality we get very similar results to the ones obtained by tools like DSSR. In terms of speed, DSSR needs much longer run times. For example, for 4U4O, DSSR needed ~15 min for the secondary and tertiary structure analysis [27], while our algorithm only needs ~0.2 s (see Tab. 1).

As noted above, DSSR provides far more structural features than just the identifications of nucleotides and several common base pairs. Even for the identified base pairs, DSSR provides many more annotations and structural parameters than just the named pairs picked by the authors. Not surprisingly, DSSR is slower than the dedicated method for a specific purpose.

As of DSSR v1.8.3-2018oct29, I’ve added the --pair-only option that just outputs a complete listing of base-pairing information and then stops. Some sample runs are as below:

x3dna-dssr -i=1ehz.pdb --pair-only
x3dna-dssr -i=1ehz.pdb --pair-only --more
x3dna-dssr -i=1ehz.pdb --pair-only --json
x3dna-dssr -i=1ehz.pdb --pair-only --json | jq '.pairs[] | select(.name=="WC")'
x3dna-dssr -i=1ehz.pdb --pair-only --more --json | jq .
x3dna-dssr -i=4u4o.cif --pair-only -o=4u4o-pairs.txt

Compared to the default settings, DSSR runs ~10 times faster when the --pair-only option is set; 36s vs 5m48s for 4U4O on my MacBook Pro 2017 (2.9 GHz Intel Core i7). Note the timing here is a complete run of the DSSR program (as shown above), from reading the mmCIF file to writing out all the derived features. In my hand, simply reading and parsing the 85MB 4U4O.cif would take ~5s. As a reference, just loading 4U4O.cif into PyMOL takes >6s. I’m thus more than surprised (and remain to be convinced) by the claim that their new algorithm “only needs” ~0.2s “for the secondary and tertiary structure analysis” of 4U4O.



Enhanced features in DSSR for G-quadruplexes

Over the past couple of months, I’ve further enhanced the DSSR-derived structural features for Q-quadruplexes (G4). One was the implementation of the single descriptor of intramolecular canonical G4 structures with three connecting loops recently proposed by Dvorkin et al. The descriptor contains the number of guanines in the G4 stem, the type and relative direction of loops linking G-tracts of the stem, and the groove-widths associated with lateral loops. For example, PDB entry 2GKU (see the DSSR-enabled PyMOL schematic image below, Fig. 1A) has the following DSSR output.

List of 1 G4-stem
  Note: a G4-stem is defined as a G4-helix with backbone connectivity.
        Bulges are also allowed along each of the four strands.
  stem#1[#1] layers=3 INTRA-molecular loops=3 descriptor=3(-P-Lw-Ln) note=hybrid-1(3+1) UUDU anti-parallel
   1  glyco-bond=ss-s groove=-wn- mm(<>,outward)  area=14.24 rise=3.58 twist=16.8  nts=4 GGGG A.DG3,A.DG9,A.DG17,A.DG21
   2  glyco-bond=--s- groove=-wn- pm(>>,forward)  area=13.12 rise=3.71 twist=25.9  nts=4 GGGG A.DG4,A.DG10,A.DG16,A.DG22
   3  glyco-bond=--s- groove=-wn-                                                  nts=4 GGGG A.DG5,A.DG11,A.DG15,A.DG23
    strand#1  U DNA glyco-bond=s-- nts=3 GGG A.DG3,A.DG4,A.DG5
    strand#2  U DNA glyco-bond=s-- nts=3 GGG A.DG9,A.DG10,A.DG11
    strand#3  D DNA glyco-bond=-ss nts=3 GGG A.DG17,A.DG16,A.DG15
    strand#4  U DNA glyco-bond=s-- nts=3 GGG A.DG21,A.DG22,A.DG23
    loop#1 type=propeller strands=[#1,#2] nts=3 TTA A.DT6,A.DT7,A.DA8
    loop#2 type=lateral   strands=[#2,#3] nts=3 TTA A.DT12,A.DT13,A.DA14
    loop#3 type=lateral   strands=[#3,#4] nts=3 TTA A.DT18,A.DT19,A.DA20

The descriptor=3(-P-Lw-Ln) means that the G4 structure has three layers of G-tetrads, connected via three loops: the first is the Propeller loop in anti-clockwise (negative) direction, then the Lateral loop passing a wide groove anti-clockwise, and finally another Lateral loop passing a narrow groove, also anti-clockwise. The DSSR symbols follow those of Dvorkin et al. but with capital letters L, P, and D for lateral, propeller, and diagonal loops instead of lower case letters (l, p, d) to avoid using subscript for groove-width info. So the 2GKU descriptor 3(-P-Lw-Ln) from DSSR corresponds to 3(-p-lw-ln) of Dvorkin et al.

The DSSR-enabled, PyMOL-rendered, block image in Fig. 1A makes the three G-tetrad layers (squared green blocks) immediately obvious. Other base identities and stacking interactions also become clear — for example, the A24 (in red) stacks on the top G-tetrad, and T1-A20 pair stacks with the bottom G-tetrad.

Two other PDB entries (2LOD and 2KOW) are illustrated in Fig. 1B and Fig. 1C. They have different topologies than 2GKU (Fig. 1A). DSSR is able to characterize all of them consistently.

DSSR-enabled G4 analysis and representation
Figure 1. DSSR-enabled, PyMOL-rendered, block images of five G-quadruplexes. A in red, C in yellow, G (and G-tetrad) in green, and T in blue.

Another G4-related new feature in DSSR is the detection of V-shaped loops in noncanonical G4 structures where one of the four G-G columns (strands) that link adjacent G-tetrads is broken. Two of recent PDB examples with V-loops are shown in Fig. 1D (5ZEV) and Fig. 1E (6H1K). An excerpt of DSSR output for the PDB entry 6H1K is shown below.

List of 1 G4-helix
  Note: a G4-helix is defined by stacking interactions of G4-tetrads, regardless
        of backbone connectivity, and may contain more than one G4-stem.
  helix#1[1] stems=[#1] layers=3 INTRA-molecular
   1  glyco-bond=-sss groove=w--n mm(<>,outward)  area=12.76 rise=3.47 twist=18.2  nts=4 GGGG A.DG2,A.DG19,A.DG15,A.DG26
   2  glyco-bond=s--- groove=w--n pm(>>,forward)  area=12.84 rise=3.07 twist=33.4  nts=4 GGGG A.DG1,A.DG20,A.DG16,A.DG27
   3  glyco-bond=s--- groove=w--n                                                  nts=4 GGGG A.DG25,A.DG21,A.DG17,A.DG28
    strand#1 DNA glyco-bond=-ss nts=3 GGG A.DG2,A.DG1,A.DG25
    strand#2 DNA glyco-bond=s-- nts=3 GGG A.DG19,A.DG20,A.DG21
    strand#3 DNA glyco-bond=s-- nts=3 GGG A.DG15,A.DG16,A.DG17
    strand#4 DNA glyco-bond=s-- nts=3 GGG A.DG26,A.DG27,A.DG28

List of 1 G4-stem
  Note: a G4-stem is defined as a G4-helix with backbone connectivity.
        Bulges are also allowed along each of the four strands.
  stem#1[#1] layers=2 INTRA-molecular loops=3 descriptor=2(D+PX) note=UD3(1+3) UDDD anti-parallel
   1  glyco-bond=s--- groove=w--n mm(<>,outward)  area=12.76 rise=3.47 twist=18.2  nts=4 GGGG A.DG1,A.DG20,A.DG16,A.DG27
   2  glyco-bond=-sss groove=w--n                                                  nts=4 GGGG A.DG2,A.DG19,A.DG15,A.DG26
    strand#1  U DNA glyco-bond=s- nts=2 GG A.DG1,A.DG2
    strand#2  D DNA glyco-bond=-s nts=2 GG A.DG20,A.DG19
    strand#3  D DNA glyco-bond=-s nts=2 GG A.DG16,A.DG15
    strand#4  D DNA glyco-bond=-s nts=2 GG A.DG27,A.DG26
    loop#1 type=diagonal  strands=[#1,#3] nts=12 GAGGCGTGGCCT A.DG3,A.DA4,A.DG5,A.DG6,A.DC7,A.DG8,A.DT9,A.DG10,A.DG11,A.DC12,A.DC13,A.DT14
    loop#2 type=propeller strands=[#3,#2] nts=2 GC A.DG17,A.DC18
    loop#3 type=diag-prop strands=[#2,#4] nts=5 GACTG A.DG21,A.DA22,A.DC23,A.DT24,A.DG25

List of 2 non-stem G4 loops (INCLUDING the two terminal nts)
   1 type=lateral   helix=#1 nts=5 GACTG A.DG21,A.DA22,A.DC23,A.DT24,A.DG25
   2 type=V-shaped  helix=#1 nts=4 GGGG A.DG25,A.DG26,A.DG27,A.DG28

Note that here a new loop type (diag-prop) and topology description symbol (X) are introduced. In developing DSSR in general, and G4-related features in particular, I’ve always tried to follow conventions widely used by the community. Whereas inconsistency exists, I pick up the ones that are in line with other parts of DSSR. For unique DSSR features lacking outside references, I came up my own nomenclature. When DSSR becomes more widely used, it may serve to standardize G4 nomenclatures.



DSSR is fast for MD analysis

From early on, the --json and --nmr options in DSSR have provided a convenient means to analyze an ensemble of solution NMR structures in the standard PDB/mmCIF format, as those available from the Protein Data Bank (PDB). The usage is very simple, as shown below for the PDB entry 2lod. The parameters for each model can be easily parsed from the output JSON stream.

x3dna-dssr -i=2lod.pdb --nmr --json

A practical example of the DSSR JSON/NMR usage for the analysis of RNA backbone torsion angles can be found on the 3DNA Forum.

While not a practitioner of molecular dynamics (MD) simulations, I’ve regularly followed the relevant literature. I know of the popular tools such as MDanalysis, MDTraj, and CPPTRAJ that are dedicated to analyze trajectories of MD simulations. I understand the subtleties MD may have, and I’m also sure of the unique features DSSR has to offer. By design, I made the DSSR interface to MD straightforward, by simply following commonly-used standard data formats: the MODEL/ENDMDL delineated PDB (or the PDBx/mmCIF) format for input, and JSON for output. Overall, I had expected that DSSR would complement the dedicated tools (e.g., MDanalysis, MDTraj, and CPPTRAJ) for MD analysis.

Over the years, DSSR has gradually gained recognition in the MD field. At a meeting, I once heard of a user complaining that DSSR is too slow for the analysis of millions of frames of MD simulations. Yet, without access to a large MD dataset and direct collaborations from a user, the speed issue could not be pursued further. In my experience, I knew DSSR is fast enough for the analysis of NMR ensembles from the PDB. This situation has completely changed recently, after a user reported on the 3DNA Forum on the slowness of DSSR on MD analysis.

Do you have an idea why the backbone parameter for a nucleic acids are so much faster calculated with do_x3dna than with DSSR? Analyzing a trajectory with 100k frames take for a native structure approx. 2 hours with do_x3dna. A native RNA structure with DSSR will take approx. 10 days (10k frames/day). I need to run DSSR, because my system contains an abasic site.

With the above and follow-up information provided, I was able to fix the DSSR algorithm for parsing MD trajectories, among other things. Now DSSR reads a trajectory sequentially frame-by-frame at constant speed. The same 100K frames takes 36 minutes to finish instead of 10 days, which is an increase of 10*24*60/36=400 times. This 100x speedup was later on verified when I tested DSSR on the 1000-structure trajectory the user supplied.

So as of v1.7.8-2018sep01, DSSR is quick enough for real-world applications on MD analysis. In the releases of DSSR afterwards, I’ve further polished the code and added some new features. All things considered, DSSR is bound to become more relevant in the active MD field in the years to come.

By the way, for those who do not like the --nmr option, --md or --ensemble also works. These three alternatives are equivalent to DSSR internally.



Integrations of DSSR to other bioinformatics resources

As mentioned in the blog post Integrating DSSR into Jmol and PyMOL,
“The small size, zero configuration, extensive features, and robust performance make DSSR ideal to be integrated into other bioinformatics tools.” In addition to the DSSR-Jmol and DSSR-PyMOL integrations which I initiated and got personally involved, other bioinformatics resources are increasingly taking advantage of what DSSR has to offer. Here are a few examples:

Before aligning structures, STAR3D preprocesses PDB files with base-pairing annotation using either MC-Annotate (Gendron et al., 2001; Lemieux and Major, 2002) (for PDB inputs) or DSSR (Lu et al., 2015) (for PDB and mmCIF inputs) and pseudo-knot removal using RemovePseudoknots (Smit et al., 2008).

2014, RNApdbee: In order to facilitate a more comprehensive study, the webserver integrates the functionality of RNAView, MC-Annotate and 3DNA/DSSR, being the most common tools used for automated identification and classification of RNA base pairs.

2018, RNApdbee 2.0: Base pairs can be identified by 3DNA/DSSR (default) (4), RNAView (5), MC-Annotate (3) or newly added FR3D (15).

  • The Universe of RNA Structures (URS) web-interface to the URS database (URSDB) makes extensive use of DSSR. For each analyzed structure (including PDB entries), the DSSR text output file (termed “DSSR-file”) is also available. Impressively, the maintainers of URS are quick with DSSR updates. The current version used by the URS website is DSSR v1.7.4-2018jan30.

Forty years after the yeast phenylalanine tRNA structure was solved, modified nucleotides should no longer be an issue for RNA structural analysis, especially for this classic molecule. Automatic processing of modified nucleotides is just one aspect of DSSR’s substantial set of features. Based on my understanding of the field, more structural bioinformatics resources/tools could benefit from DSSR. Simply put, if one’s project is related to 3D DNA or RNA structures, DSSR may be of certain help. It’s just a timing issue that DSSR would benefit a (much) larger community.



DSSR-enabled cartoon block images for G-quadruplex

Via Google Scholar, I noticed the recent publication in Nucleic Acids Research by Meier et al. titled Structure and hydrodynamics of a DNA G-quadruplex with a cytosine bulge. Reading through the article, I am pleased to see the section “Nucleic acid geometry and visualization” under MATERIALS AND METHODS:

We used the program DSSR (53) of the 3DNA suite (54) to analyse the nucleic acid backbone and the base pair geometry from the 3D structures. We reported the ‘simple’ base-pair parameters for buckle, propeller twist and stagger which are more intuitive for non-canonical base-pairs than the classic base-pair parameters as explained in the program manual and the 3DNA website (, We wrote an R (55) script that automatically creates a backbone angle plot from the output of the DSSR program. The script can be downloaded from the 3DNA forum at The nucleic acid was visualized in PyMOL and the dssr block plugin (The PyMOL Molecular Graphics Sys- tem, Version 2.0, Schro ̈dinger, LLC, …

This is the first time (I’m aware of) that the ‘simple’ base-pair parameters introduced in 3DNA v2.3 is cited in a peer-reviewed journal article. I’m also glad to know that the blog posts on the 3DNA homepage are read, and even referenced in a publication — which surely will prompt me to write more. This is also the first time that the dssr_block PyMOL plugin is cited. I must say that Figures 1, 5, and 6 from the paper look gorgeous. Among other things, the G-tetrads and the surrounding base identity are immediately obvious using the simple color code: A, red; T, blue, C, yellow, and G, green. See Fig. 1 below.

DSSR-enabled block images [Fig. 1 of Meier18-nar-gky307]

In the section on “DATA AVAILABILITY”, the authors further noted:

Our R (55) script that automatically creates backbone angle plots from the output of the DSSR program can be downloaded from the 3DNA forum at

I communicated with Markus Meier (the lead author) on the 3DNA Forum, on the thread DSSR: Analyzing NMR structures – overwritten output files. Checking the thread right now, I found that the R script (backbone_torsions_plot-1.0.tar.bz2) has been downloaded 263 times. I appreciate Markus’s effort in contributing the R script with a working example to the DSSR user community. It has always been my hope that more DSSR users would share their scripts and examples via the 3DNA Forum.

As a side note, I met Markus in Los Angles at the 60th Annual Meeting. It was a nice experience chatting with him, and had a lunch together. We’ve kept in touch following the meeting.



Over 10K nucleic-acid-containing structures in the PDB

When visiting the RCSB PDB website today, I am please to notice that the PDB now contains “10015 Nucleic Acid Containing Structures”. Based on “Macromolecule Type” in “Advanced Search” of the RCSB PDB website, I observed the following information:

  • The number of DNA-containing structures is 6,384 (reported in 2,997 papers), and the corresponding number for RNA-containing structures is 3,861 (associated with 2,012 publications).
  • There are 4,570 structures containing both DNA and protein (potentially forming DNA-protein complexes), and 2,478 RNA-protein complexes.
  • The smallest nucleic-acid-containg structures have only two nucleotides (e.g., 3rec), and largest ones are the ribosomes (and virus particles).
  • The earliest released DNA structure from the PDB is 1zna (on March 18, 1981), a Z-DNA tetramer. The earliest RNA structure released is 4tna (on April 12, 1978), a refined structure of the yeast phenylalanine transfer RNA.

This landmark achievement is made possible by the world-wide scientific community through decades of efforts solving DNA/RNA 3D structures via experimental approaches (mainly solution NMR, x-ray crystal, and cryo-EM). These over 10K nucleic acid structures present both challenges and opportunities for the field of structural bioinformatics, especially for intricate RNA molecules. DSSR is an integrated software tool for dissecting the spatial structure of RNA. It is my effort in addressing the challenging issues for the analysis/annotation and visualization of RNA structures.



One base forming two Watson-Crick pairs?

It is textbook knowledge that the Watson-Crick (WC) pairs are specific, forming only between A and T/U (A–T/U or T/U–A) or G and C (G–C or C–G). Furthermore, an A only forms one WC pair with a T, so is G vs. C. The widely used dot-bracket-notation (DBN) of DNA/RNA secondary structure depends crucially on this feature of specificity and uniqueness, by using matched parentheses to represent WC pairs, such as ((....)) for a GCGA (GNRA-type) tetra-loop of sequence GCGCGAGC.

The reality is more complicated, even for what’s presumably to be a ‘simple’ question of deriving RNA secondary structure from 3D coordinates in PDB. One subtlety is related to the ambiguity of atomic coordinates that renders one base apparently forming two WC pairs with two other complementary bases. As always, the case can be best illustrated with a concrete example. The image shown below is taken from PDB entry 1qp5 where C20 (on chain B) forms two WC pairs, each with G4 and G5 (on chain A) respectively.

C forming two WC G-C pairs in PDB entry 1qp5

Clearly, taking both as valid WC G–C pairs would make the resultant DBN illegitimate. DSSR resolves such discrepancies by taking structural context into consideration to ensure that one base can only have a WC pair with another base. Here the G5–C20 WC pair is retained whilst the G4–C20 WC is removed.

This issue, one base can form two WC pairs as derived from the PDB, has been noticed for a long while. Two examples from literature are shown below:

The crystal structure data files were downloaded from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (Berman et al. 2000). For each crystal structure, the set of canonical base pairs was extracted by selecting all Watson–Crick and standard G-U wobble pairs found by RNAview (Yang et al. 2003). Occasional conflicts in this list, where RNAview annotates two bases, x and y, as a standard base pair and also y and z as another conflicting base pair, were removed manually by visual inspection of the crystal structure in the program PyMOL ( The helix-extension data set was created by taking the canonical pairs and adding all additional base–base interactions identified by RNAview (excluding stacked bases and tertiary interactions) for which the direct neighbor was already in the collection. This means each base pair (i,j) was added if both i and j were still unpaired and if either (i + 1, j – 1) or (i –1, j + 1) were already in the set.

… From these complexes, we retrieved all RNA chains also marked as non-redundant by RNA3DHub. Each chain was annotated by FR3D. Because FR3D cannot analyze modified nucleotides or those with missing atoms, our present method does not include them either. If several models exist for a same chain, the first one only was considered. For the rest of this paper, the base pairs extracted from the FR3D annotations are those defined in the Leontis–Westhof geometric classification (24).

For each chain a secondary structure without pseudoknots was deduced from the annotated interactions, as follows. First all canonical Watson–Crick and wobble base pairs (i.e. A-U, G-C and G-U) were identified. Then, since many structures are naturally pseudoknotted, we used the K2N (25) implementation in the PyCogent (26) Python module to remove pseudoknots. Problems arise when a nucleotide is involved in several Watson–Crick base pairs (which is geometrically not feasible), probably due to an error of the automatic annotation. Those discrepancies were removed with a ad hoc algorithm such that if a nucleotide is involved in several Watson–Crick base pairs, we remove the base pair which belongs to the shortest helix.

By design, DSSR takes care of these ‘little details’, among other handy features (such as handling modified nucleotides and removing pseudoknots). By providing a robust infrastructure and comprehensive framework, DSSR allows users to focus on their research topics. If you have experience with other tools, such as RNAView and FR3D cited above, give DSSR a try: it may fit your needs better.



DNA conformational changes may play an active role in viral genome packaging

An article titled Simulations and electrostatic analysis suggest an active role for DNA conformational changes during genome packaging by bacteriophages has recently been published in bioRxiv. I was honored to have the opportunity collaborating with fellow researchers from University of Pennsylvania and Thomas Jefferson University in this significant piece of work.

Here is the abstract. Please download the PDF version to know more.

Motors that move DNA, or that move along DNA, play essential roles in DNA replication, transcription, recombination, and chromosome segregation. The mechanisms by which these DNA translocases operate remain largely unknown. Some double-stranded DNA (dsDNA) viruses use an ATP-dependent motor to drive DNA into preformed capsids. These include several human pathogens, as well as dsDNA bacteriophages (viruses that infect bacteria). We previously proposed that DNA is not a passive substrate of bacteriophage packaging motors but is, instead, an active component of the machinery. Computational studies on dsDNA in the channel of viral portal proteins reported here reveal DNA conformational changes consistent with that hypothesis. dsDNA becomes longer (“stretched”) in regions of high negative electrostatic potential, and shorter (“scrunched”) in regions of high positive potential. These results suggest a mechanism that couples the energy released by ATP hydrolysis to DNA translocation: The chemical cycle of ATP binding, hydrolysis and product release drives a cycle of protein conformational changes. This produces changes in the electrostatic potential in the channel through the portal, and these drive cyclic changes in the length of dsDNA. The DNA motions are captured by a coordinated protein-DNA grip-and-release cycle to produce DNA translocation. In short, the ATPase, portal and dsDNA work synergistically to promote genome packaging.



Handling of abasic sites in DSSR

An abasic site is a location in DNA or RNA where a purine or pyrimidine base is missing. It is also termed an AP site (i.e., apurinic/apyrimidinic site) in biochemistry and molecular genetics. The abasic site can be formed either spontaneously (e.g., depurination) or due to DNA damage (occurring as intermediates in base excision repair). According to Wikipedia, “It has been estimated that under physiological conditions 10,000 apurinic sites and 500 apyrimidinic may be generated in a cell daily.”

In DSSR and 3DNA v2.x, nucleotides are recognized using standard atom names and base planarity. Thus, abasic sites are not taken as nucleotides (by default), simply because they do not have base atoms. DSSR introduced the --abasic option to account for abasic sites, a feature useful for detecting loops with backbone connectivity.

For example, by default, DSSR identifies one internal loop (no. 1 in the list below) in PDB entry 1l2c. With the --abasic option, two internal loops (including the one with the abasic site C.HPD18, no. 2) are detected.

List of 2 internal loops
   1 symmetric internal loop: nts=6; [1,1]; linked by [#-1,#1]
     summary: [2] 1 1 [B.1 C.24 B.3 C.22] 1 4
     nts=6 GTATAC B.DG1,B.DT2,B.DA3,C.DT22,C.DA23,C.DC24
       nts=1 T B.DT2
       nts=1 A C.DA23
   2 symmetric internal loop: nts=6; [1,1]; linked by [#1,#2]
     summary: [2] 1 1 [B.6 C.19 B.8 C.17] 4 5
     nts=6 CTTA?G B.DC6,B.DT7,B.DT8,C.DA17,C.HPD18,C.DG19
       nts=1 T B.DT7
       nts=1 ? C.HPD18

Note that C.HPD18 in 1l2c is a non-standard residue, as shown in the HETATM records below. Since the identity of C.HPD18 cannot be deduced from the atomic records, its one-letter code is designated as ?.

HETATM  346  P   HPD C  18     -14.637  52.299  29.949  1.00 49.12           P
HETATM  347  O5' HPD C  18     -14.658  52.173  28.359  1.00 48.28           O
HETATM  348  O1P HPD C  18     -15.167  51.040  30.537  1.00 49.35           O
HETATM  349  O2P HPD C  18     -13.303  52.798  30.369  1.00 46.43           O
HETATM  350  C5' HPD C  18     -15.703  51.469  27.687  1.00 45.70           C
HETATM  351  O4' HPD C  18     -16.364  50.501  25.561  1.00 44.15           O
HETATM  352  O3' HPD C  18     -13.990  51.738  24.335  1.00 45.75           O
HETATM  353  C1' HPD C  18     -16.105  54.187  25.684  1.00 52.47           C
HETATM  354  O1' HPD C  18     -17.309  54.085  26.496  1.00 56.16           O
HETATM  355  C3' HPD C  18     -14.756  52.250  25.426  1.00 46.23           C
HETATM  356  C4' HPD C  18     -15.263  51.093  26.291  1.00 45.72           C
HETATM  357  C2' HPD C  18     -16.030  52.889  24.898  1.00 49.05           C

In contrast, the R.U-8 in PDB entry 4ifd is a standard U, and is properly labeled by DSSR.

ATOM  26418  P     U R  -8     139.362  21.962 129.430  1.00208.29           P
ATOM  26419  OP1   U R  -8     140.062  20.821 130.074  1.00207.30           O
ATOM  26420  OP2   U R  -8     140.113  23.208 129.129  1.00208.44           O1+
ATOM  26421  O5'   U R  -8     138.712  21.439 128.071  1.00157.60           O
ATOM  26422  C5'   U R  -8     139.507  20.790 127.087  1.00155.47           C
ATOM  26423  C4'   U R  -8     138.843  20.804 125.731  1.00152.27           C
ATOM  26424  O4'   U R  -8     138.538  22.172 125.352  1.00149.29           O
ATOM  26425  C3'   U R  -8     139.677  20.275 124.572  1.00152.70           C
ATOM  26426  O3'   U R  -8     139.670  18.859 124.478  1.00155.04           O
ATOM  26427  C2'   U R  -8     139.053  20.969 123.369  1.00150.26           C
ATOM  26428  O2'   U R  -8     137.849  20.322 122.984  1.00146.83           O
ATOM  26429  C1'   U R  -8     138.700  22.334 123.958  1.00147.35           C

This is yet another little detail that DSSR takes care of. It is the close consideration to many such subtle points that makes DSSR different. Overall, DSSR represents my view of what a scientific software program could be (or should be).



Stem, helix, and coaxial stacking in DSSR

DSSR deliberately makes a distinction between ‘stem’ and ‘helix’, as shown below:

a helix is defined by base-stacking interactions, regardless of bp type and backbone connectivity, and may contain more than one stem.

a stem is defined as a helix consisting of only canonical WC/wobble pairs, with a continuous backbone.

By definition, a helix or stem consists of at least two base-pairs with stacking interactions. Helix is more inclusive and may contain more than one stem. This differentiation between ‘helix’ and ‘stem’ naturally leads to the definition of coaxial stacking, another widely used yet vaguely specified concept.

Again, the abstract notion can be best illustrated with a concrete example. In the classic yeast phenylalanine tRNA (PDB id: 1ehz), DSSR identifies that two stems [the acceptor stem (right) and the T stem (left)] are coaxially stacked within one double helix. See the figure below.

tRNA acceptor and T stems in one helix (1ehz)

In the above schematics cartoon-block representation, each Watson-Crick base pair is rendered as a single, long rectangular block. Base identities of the G–U wobble, and the two non-canonical pairs (left terminal) are illustrated separately, with a larger block size for purines (G and A), and a smaller size for pyrimidines (C, U, and T).

I picked up ‘stem’ as a more specialized duplex because it is widely used in the RNA stem-loop structure, and in describing the four ‘paired regions’ of the classic tRNA cloverleaf secondary structure. On the other hand, ‘helix’ is (to me at least) a more general term, and thus more inclusive. It is worth noting that other terms such as ‘arm’, ‘paired region’, or ‘helix’ etc. have also been used interchangeably in the literature to refer what DSSR designated as ‘stem’.

As a side note, the basic algorithm for identifying helixes/stems in DSSR is also applicable for detecting G-quadruplexes. The same idea of ‘helix’ or ‘stem’ also applies here (see figure below for PDB entry: 5dww). Indeed, as of v1.7.0-2017oct19, DSSR contains a new section for the identification and characterization of G-quadruplexes.

G-quadruplex (PDB entry: 5dww)

DSSR is “an integrated software tool for dissecting the spatial structure of RNA”. It excels in consolidating the diverse pieces together via a coherent framework, readily accessible in a solid software product. DSSR may well serve as a cornerstone in RNA structural bioinformatics and would facilitate communications in the broad areas related to nucleic acids structures.



Base stacks in non-stem regions

Among the rich set of RNA structural features derived by DSSR, the section of “List of stacks” apparently has not drawn much attention from the user community. As noted in the DSSR output,

a stack is an ordered list of nucleotides assembled together via base-stacking interactions, regardless of backbone connectivity. Stacking interactions within a stem are not included.

As always, the concept is best illustrated via concrete examples. Shown below are two such base stacks automatically identified by DSSR in the PDB entry 4p5j, the crystal structure of the tRNA-mimic from Turnip Yellow Mosaic Virus (TYMV) which was analyzed in detail in the 2015 DSSR NAR paper

tRNA mimic linchpin stablized by base-stacking The D- and T-loops stablized by base-stacking
This critical linchpin in the tRNA mimic is stabilized by extensive base-stacking interactions. The intricate interactions between the D- and T-loops in the tRNA mimic include a five-base stack.

The DSSR-introduced schematic block representation makes the base-stacking interactions immediately obvious. One can even easily discern the identity of bases, given the color-coding convention: A-red; C-yellow; G-green; T-blue; U-cyan. For example, the five stacked bases involved in the interaction of the D- and T-loops are: CAAAC

Moreover, longer and more complicate base-stacks can also be auto-detected by DSSR, as shown below for the asymmetric unit of PDB entry 1j8g, the crystal structure of an RNA quadruplex r(UGGGGU)4 at 0.61 Å resolution. Here DSSR identifies two 10-base stacks, each of UGGGGGGGGU (UG8U).

Two 10-base stacks in 1j8g

The corresponding DSSR output is as below:

List of 2 stacks
  Note: a stack is an ordered list of nucleotides assembled together via
        base-stacking interactions, regardless of backbone connectivity.
        Stacking interactions within a stem are *not* included.
   1 nts=10 UGGGGGGGGU A.U6,A.G5,A.G4,A.G3,A.G2,C.G22,C.G23,C.G24,C.G25,C.U26
   2 nts=10 UGGGGGGGGU B.U16,B.G15,B.G14,B.G13,B.G12,D.G32,D.G33,D.G34,D.G35,D.U36



Identification and characterization of G-quadruplexes

G-quadruplexes (hereafter referred to as G4) are a common type of higher-order DNA and RNA structures formed from G-rich sequences. The building block of G4 is a tetrad of guanines in a cyclic planar alignment, with four G+G pairs (cW+M type, see Figure below). A G4 structure is formed by stacking of G-tetrads and stabilized by cations at the center of the layers. G4 structures are polymorphic: the four strands can be parallel or anti-parallel, and loops connecting them can be of different types: lateral (edgewise), diagonal, or propeller (double-chain reversal). Moreover, G4 structures can be intra- or intermolecular, and even contain bulges.

From its initial releases, DSSR was able to detect G-tetrads, and listed them in a separate section. As of v1.7.0-2017oct19, DSSR has integrated existing features and created a new module to automatically identify and fully characterize G4 structures. The underlying algorithms have been further refined in v1.7.1-2017nov01, which was tested against all nucleic-acid-containing structures in the PDB.

Characterizations of three representative G4 examples (PDB entries 2m4p, 2hy9, and 5hix) are shown below, illustrating salient features (e.g., different types of loops) automatically extracted by DSSR.


stem#1[#1] layers=3 INTRA-molecular parallel bulged-strands=1
   1 syn=---- WC-->Major area=8.38  rise=3.64 twist=33.34 nts=4 GGGG A.DG3,A.DG8,A.DG12,A.DG16
   2 syn=---- WC-->Major area=10.73 rise=3.23 twist=32.42 nts=4 GGGG A.DG5,A.DG9,A.DG13,A.DG17
   3 syn=---- WC-->Major                                  nts=4 GGGG A.DG6,A.DG10,A.DG14,A.DG18
    strand#1* +1 DNA syn=--- nts=3 GGG A.DG3,A.DG5,A.DG6 bulged-nts=1 T A.DT4
    strand#2  +1 DNA syn=--- nts=3 GGG A.DG8,A.DG9,A.DG10
    strand#3  +1 DNA syn=--- nts=3 GGG A.DG12,A.DG13,A.DG14
    strand#4  +1 DNA syn=--- nts=3 GGG A.DG16,A.DG17,A.DG18
    loop#1 type=propeller strands=[#1,#2] nts=1 T A.DT7
    loop#2 type=propeller strands=[#2,#3] nts=1 T A.DT11
    loop#3 type=propeller strands=[#3,#4] nts=1 T A.DT15


stem#1[#1] layers=3 INTRA-molecular anti-parallel
   1 syn=ss-s Major-->WC area=13.69 rise=3.14 twist=19.08 nts=4 GGGG 1.DG4,1.DG10,1.DG18,1.DG22
   2 syn=--s- WC-->Major area=13.40 rise=3.05 twist=28.05 nts=4 GGGG 1.DG5,1.DG11,1.DG17,1.DG23
   3 syn=--s- WC-->Major                                  nts=4 GGGG 1.DG6,1.DG12,1.DG16,1.DG24
    strand#1  +1 DNA syn=s-- nts=3 GGG 1.DG4,1.DG5,1.DG6
    strand#2  +1 DNA syn=s-- nts=3 GGG 1.DG10,1.DG11,1.DG12
    strand#3  -1 DNA syn=-ss nts=3 GGG 1.DG18,1.DG17,1.DG16
    strand#4  +1 DNA syn=s-- nts=3 GGG 1.DG22,1.DG23,1.DG24
    loop#1 type=propeller strands=[#1,#2] nts=3 TTA 1.DT7,1.DT8,1.DA9
    loop#2 type=lateral   strands=[#2,#3] nts=3 TTA 1.DT13,1.DT14,1.DA15
    loop#3 type=lateral   strands=[#3,#4] nts=3 TTA 1.DT19,1.DT20,1.DA21


stem#1[#1] layers=4 inter-molecular anti-parallel
   1 syn=s--s Major-->WC area=12.93 rise=3.64 twist=16.82 nts=4 GGGG A.DG1,B.DG4,A.DG12,B.DG9
   2 syn=-ss- WC-->Major area=18.96 rise=3.71 twist=35.87 nts=4 GGGG A.DG2,B.DG3,A.DG11,B.DG10
   3 syn=s--s Major-->WC area=15.16 rise=3.64 twist=18.64 nts=4 GGGG A.DG3,B.DG2,A.DG10,B.DG11
   4 syn=-ss- WC-->Major                                  nts=4 GGGG A.DG4,B.DG1,A.DG9,B.DG12
    strand#1  +1 DNA syn=s-s- nts=4 GGGG A.DG1,A.DG2,A.DG3,A.DG4
    strand#2  -1 DNA syn=-s-s nts=4 GGGG B.DG4,B.DG3,B.DG2,B.DG1
    strand#3  -1 DNA syn=-s-s nts=4 GGGG A.DG12,A.DG11,A.DG10,A.DG9
    strand#4  +1 DNA syn=s-s- nts=4 GGGG B.DG9,B.DG10,B.DG11,B.DG12
    loop#1 type=diagonal  strands=[#1,#3] nts=4 TTTT A.DT5,A.DT6,A.DT7,A.DT8
    loop#2 type=diagonal  strands=[#2,#4] nts=4 TTTT B.DT5,B.DT6,B.DT7,B.DT8

Representative G4 structures

The molecular structure of the G-tetrad and two G4 structures in schematics representation. Upper left: atomic structure of G-tetrad, the building block of G4 structures. Here the green ‘square’ is created by connecting the C1’ atoms of the guanosines, and it is used to simplify the representation of G4 structures of PDB entries 2m4p (lower left) and 5dww (right). Note that the asymmetric unit of 5dww contains four biological units, which are coaxially stacked in two columns.

The DSSR output for PDB entry 5dww is listed below, showing the differences of a G4-helix vs. a G4-stem.


 Note: a G4-helix is defined by stacking interactions of G4-tetrads, regardless
        of backbone connectivity, and may contain more than one G4-stem.
  helix#1[#2] layers=6 inter-molecular stems=[#1,#2]
   1 syn=---- WC-->Major area=10.64 rise=3.54 twist=28.10 nts=4 GGGG A.DG3,A.DG7,A.DG11,A.DG16
   2 syn=.--- WC-->Major area=11.63 rise=3.65 twist=31.14 nts=4 GGGG A.DG2,A.DG6,A.DG10,A.DG15
   3 syn=---- WC-->Major area=28.36 rise=3.31 twist=-9.78 nts=4 GGGG A.DG1,A.DG5,A.DG9,A.DG14
   4 syn=---- Major-->WC area=11.60 rise=3.75 twist=29.43 nts=4 GGGG C.DG1,C.DG14,C.DG9,C.DG5
   5 syn=---- Major-->WC area=10.35 rise=3.49 twist=28.74 nts=4 GGGG C.DG2,C.DG15,C.DG10,C.DG6
   6 syn=---- Major-->WC                                  nts=4 GGGG C.DG3,C.DG16,C.DG11,C.DG7
    strand#1 DNA syn=-.---- nts=6 GGGGGG A.DG3,A.DG2,A.DG1,C.DG1,C.DG2,C.DG3
    strand#2 DNA syn=------ nts=6 GGGGGG A.DG7,A.DG6,A.DG5,C.DG14,C.DG15,C.DG16
    strand#3 DNA syn=------ nts=6 GGGGGG A.DG11,A.DG10,A.DG9,C.DG9,C.DG10,C.DG11
    strand#4 DNA syn=------ nts=6 GGGGGG A.DG16,A.DG15,A.DG14,C.DG5,C.DG6,C.DG7
List of 4 G4-stems
  Note: a G4-stem is defined as a G4-helix with backbone connectivity.
        Bulges are also allowed along each of the four strands.
  stem#1[#1] layers=3 INTRA-molecular parallel
   1 syn=---- WC-->Major area=11.63 rise=3.65 twist=31.14 nts=4 GGGG A.DG1,A.DG5,A.DG9,A.DG14
   2 syn=.--- WC-->Major area=10.64 rise=3.54 twist=28.10 nts=4 GGGG A.DG2,A.DG6,A.DG10,A.DG15
   3 syn=---- WC-->Major                                  nts=4 GGGG A.DG3,A.DG7,A.DG11,A.DG16
    strand#1  +1 DNA syn=-.- nts=3 GGG A.DG1,A.DG2,A.DG3
    strand#2  +1 DNA syn=--- nts=3 GGG A.DG5,A.DG6,A.DG7
    strand#3  +1 DNA syn=--- nts=3 GGG A.DG9,A.DG10,A.DG11
    strand#4  +1 DNA syn=--- nts=3 GGG A.DG14,A.DG15,A.DG16
    loop#1 type=propeller strands=[#1,#2] nts=1 T A.DT4
    loop#2 type=propeller strands=[#2,#3] nts=1 T A.DT8
    loop#3 type=propeller strands=[#3,#4] nts=2 TT A.DT12,A.DT13
  stem#2[#1] layers=3 INTRA-molecular parallel
   1 syn=---- WC-->Major area=11.60 rise=3.75 twist=29.43 nts=4 GGGG C.DG1,C.DG5,C.DG9,C.DG14
   2 syn=---- WC-->Major area=10.35 rise=3.49 twist=28.74 nts=4 GGGG C.DG2,C.DG6,C.DG10,C.DG15
   3 syn=---- WC-->Major                                  nts=4 GGGG C.DG3,C.DG7,C.DG11,C.DG16
    strand#1  +1 DNA syn=--- nts=3 GGG C.DG1,C.DG2,C.DG3
    strand#2  +1 DNA syn=--- nts=3 GGG C.DG5,C.DG6,C.DG7
    strand#3  +1 DNA syn=--- nts=3 GGG C.DG9,C.DG10,C.DG11
    strand#4  +1 DNA syn=--- nts=3 GGG C.DG14,C.DG15,C.DG16
    loop#1 type=propeller strands=[#1,#2] nts=1 T C.DT4
    loop#2 type=propeller strands=[#2,#3] nts=1 T C.DT8
    loop#3 type=propeller strands=[#3,#4] nts=2 TT C.DT12,C.DT13



Detection of multiplets in DSSR

In addition to base pairs, DSSR also automatically detects higher-order base associations. They are generally termed multiplets, consisting of three or more co-planar bases arranged together via H-bonding interactions. The simplest multiplets are base triplets. For example, the yeast phenylalanine tRNA (PDB entry 1ehz) contains four base triplets, as shown below:

Four base triplets in tRNA 1ehz detected by DSSR

The well-known (types I and II) A-minor motifs are also multiplets of three bases. Similarly, the G-tetrad where four guanine bases associate via Hoogsteen H-bonding to form a square planar structure is also a special multiplet. The G-tetrad is the building block of the G-quadruplexes. As of v1.7.0-2017oct19, DSSR can automatically identify and characterize G-quadruplexes (see the DSSR User Manual).

The DSSR algorithm for detecting multiplets is generally applicable. It can identify as many co-planar bases as available in a given structure. Shown below is an octad, consisting of a G-tetrad in the middle and four Us on the peripheries. The octad is derived from PDB entry 1j8g using atomic coordinates from biological assembly 1 and 3.

Octad detected by DSSR in PDB entry 1j8g



DSSR-Jmol featured in cover image of NAR'17 web-server issue

The DSSR-Jmol paper, titled "DSSR-enhanced visualization of nucleic acid structures in Jmol", has been officially published in the 2017 web-server issue of Nucleic Acids Research (NAR). Notably, the work has been featured in the cover image, as shown below:

Cover image featuring the DSSR-Jmol paper
Caption: 3D interactive visualization of selected RNA structural features enabled by the DSSR-Jmol integration ( Clockwise from upper left: Structure of the xpt-pbuX guanine riboswitch in complex with hypoxanthine (PDB id: 4fe5) in ‘base blocks’ representation. The three-way junction loop encompassing the metabolite (in space-filling representation) is color-coded by base identity: A, red; C, yellow; G, green; U, cyan. The loop-loop interaction (a kissing-loop motif) at the top is highlighted in red (upper left corner). Structure of the Thermus thermophilus 30S ribosomal subunit in complex with antibiotics (PDB id: 1fjg) in step diagram. The 16S ribosomal RNA is color-coded in spectrum with the 5′-end in blue and the 3′-end in red (upper middle). Structure of the classic L-shaped yeast phenylalanine tRNA (PDB id: 1ehz) in step diagram, with the three hairpin loops highlighted in red and the [2,1,5,0] four-way junction loop in blue (upper right corner). Structure of the Pistol self-cleaving ribozyme (PDB id: 5ktj), showcasing (in red) the horizontal helix in space-filling representation. The helix is composed of six short stems stabilized via coaxial stacking interactions (bottom).

The DSSR-Jmol integration bridges the DSSR command-line analyzing tool and the Jmol molecular viewer seamlessly together via the standard JSON interface. Now users can select DSSR-derived RNA structural features (such as base pairs, double helices, various loops, etc.) and visualize them in novel representations in Jmol interactively. Moreover, fine-grained characteristics of these features can be queried via the Jmol SQL for DSSR. The DSSR-Jmol integration fills a gap in RNA structural bioinformatics, and brings RNA visualization to an entirely new level. The web interface ( is fully functional and easy to use, serving a huge user base of researchers, educators, and students alike.

Featured as the cover image of the 2017 NAR web-server issue, DSSR's publicity would surely increase through the DSSR-Jmol integration. Additionally, I've written a new post (on the 3DNA Forum) that provides the scripts and datafiles used to create the cover image.



SNAP is cited in the DNAproDB paper

I recently came across an article titled DNAproDB: an interactive tool for structural analysis of DNA-protein complexes by Sagendorf et al. in Nucleic Acids Research (NAR). Notably, the DNAproDB tool allows users to search the underlying database by combining features of the DNA, protein, or DNA-protein interactions at the interface. Compared to the well-established NUCPLOT tool which generates only ‘static’ schematic diagrams of protein-nucleic acid interactions, DNAproDB is interactive and more user friendly, with many new features.

It was a pleasant surprise to notice that SNAP was cited in the DNAproDB NAR paper, as follows:

Nucleotide-residue interaction geometry (stacking, pseudo-pairing or other) is determined using SNAP, a new component of the 3DNA program suite (35). SNAP also serves as a fall-back for calculating hydrogen bonds if HBPLUS cannot process the file.

I am glad that SNAP has also been used for identifying H-bonds where HBPLUS fails. The H-bonding detection algorithm, initially implemented in 3DNA (v2.3 and before) and refined in DSSR/SNAP, was originally intended to make the 3DNA software fully independent of third-party tools. I did not expect this feature could one day compete with dedicated H-bond finding tools, such as HBPLUS.

By the way, 3DNA is also cited in the DNAproDB NAR paper, as below:

DNA base pairing, shape parameters and conformation are derived from the 3DNA program suite (29) with a 10.0 Å cut-off for helix breaking.



DSSR-Jmol paper in NAR

I am pleased to announce the (advance online, May 3, 2017) publication of a new paper titled "DSSR-enhanced visualization of nucleic acid structures in Jmol" in Nucleic Acids Research (NAR). Co-authored by Robert Hanson (Jmol) and me (DSSR), the article will appear in the July 2017 web-server issue of NAR. Here are the key links related to the paper:

The DSSR-Jmol integration project was initiated in October 2013 when I approached Bob at a meeting organized by RCSB PDB at Rutgers. Thereafter, we met only once in July 2014 in Paris. Over the years, we have mostly communicated via email, occasionally facilitated by Skype. Our work bridges the DSSR command-line analyzing tool and the Jmol molecular viewer together via a simple JSON interface and a powerful query language. Users can now select DSSR-derived RNA structural features (such as base pairs, double helices, and various loops) as easily as they can select protein alpha-helices and beta-strands. Moreover, fine-grained characteristics of these features can be queried via Jmol SQL for DSSR (see examples below). Notably, the novel representation styles (step diagram and base blocks) and coloring schemes bring RNA visualization to an entirely new level (see Figure 3 of the paper).

load =1ehz/dssr   # load yeast phenylalanine tRNA to Jmol with DSSR annotation
SELECT hairpins   # select the three hairpin loops
SELECT junctions  # select the four-way junction loop
select within(dssr, "nts WHERE is_modified")  # select modified nucleotides (14 total)
SELECT within(dssr, "pairs WHERE name != 'WC'")  # select non-Watson-Crick pairs
SELECT within(dssr, "pairs WHERE name = 'WC' OR name = 'Wobble'")  # select canonical pairs
Select within(dssr, "pairs WHERE name != 'WC' AND name != 'Wobble'")  # select non-canonical pairs
SELECT within(dssr, "pairs WHERE LW = 'tSW'")  # select pairs of type tSW per Leontis-Westhof

The DSSR-Jmol integration fills a gap in RNA structural bioinformatics, serving a huge user base of researchers, educators, and students alike. Its functionality is freely accessible either via the Jmol application, or the JSmol-based website ( By adhering to web standards, the website is fully functional in all modern browsers on various computer/operating systems (including handheld devices, such as tablets and smart phones). The web interface is simple and intuitive, and new users can get started easily. It also allows power users to take full advantage of Jmol scripting via a command-line console.

This work also provides an example for integrating DSSR-derived features into other molecular graphics programs or bioinformatics pipelines involving nucleic acid structures. By design, DSSR is a stand-alone, command-line program written in ANSI C. The binary executables are only ~1MB in size, and self-contained. With zero dependencies, no setup or configuration, it is trivial to get DSSR up and running. DSSR uncovers a wide range of RNA/DNA structural features in a consistent, easily accessible framework. It possesses a much richer set of functionalities for nucleic acid structural analysis (see the DSSR User Manual) than any other existing tools I am aware of. Moreover, the program is efficient and robust, making it an ideal component to be integrated into other pipelines, especially via the standard and structured JSON interface.

Collaborating with Bob has been a truly exciting experience. The NAR-web publication represents a gratifying intermediate result along an on-going journey. Hopefully, others (may be some of you) can join us in pushing forward the field of RNA structural bioinformatics.



DSSR in the structural analysis of an E.coli initiator tRNAfMet A1-U72 variant

While browsing the latest issue (May 2017) of the RNA journal, I came across the paper titled The structure of an E. coli tRNAfMet A1–U72 variant shows an unusual conformation of the A1–U72 base pair by Monestier et al.. Reading through the text, I am pleasantly surprised by the two references to DSSR as shown below:

An analysis using DSSR (Lu et al. 2015) identifies all the secondary structure elements characteristic of the classical cloverleaf secondary structure as well as usual tertiary interactions that stabilize the L-shaped tertiary fold of the molecule.

As a consequence, the opening parameter (Lu et al. 2015) of the A1–U72 base pair becomes unusually high (153.42°). The NH2 group of A1 points toward the minor groove of the acceptor helix. An interaction between the N1 of A1 and the O2 of U72 (d = 3.0 Å) is observed which requires protonation of the N1 atom of A1.

The PDB id for the deposited structure is 5l4o. Running DSSR on this structure is straightforward: x3dna-dssr -i=5l4o.pdb --more. As with the classic yeast phenylalanine tRNA (PDB id: 1ehz), DSSR identifies two helices, three hairpin loops, one [2,1,5,0] four-way junction loop, among other features.

With regard to the unusual A1-U72 pair highlighted in the title of the paper, DSSR provides the following information. Note the * in the unconventional N1*O2 H-bond.

   1 A.A1           A.U72          A+U --          n/a       tWW  tW+W
       [-14.4(...) ~C3'-endo lambda=32.9] [-172.4(anti) ~C3'-endo lambda=65.0]
       d(C1'-C1')=10.80 d(N1-N9)=9.19 d(C6-C8)=10.68 tor(C1'-N1-N9-C1')=173.6
       H-bonds[1]: "N1*O2(carbonyl)[2.99]"
       interBase-angle=6  Simple-bpParams: Shear=3.53 Stretch=1.71 Buckle=2.0 Propeller=-6.0
       bp-pars: [-0.32   3.91    0.01    6.32    -0.26   153.56]

This citation is yet another example of DSSR’s adoption by experimental biologists. I can only expect to see more such type of DSSR usages in the coming years.



Weird PDB entries

Recently, while analyzing a representative set of RNA structures from the PDB, I came across three weird entries. They are documented below, primarily for my own record.

  • 5els — “Structure of the KH domain of T-STAR in complex with AAAUAA RNA”. There are two alternative conformations for the six-nt AAAUAA RNA component, labeled A and B, respectively. Normally, the A/B alternative coordinates for each atom are put directly next to each other, and assigned the same chain id, as in 1msy for the phosphate group of G2669 on chain A. In 5els, however, the two alternative conformations (A/B) are separated into two chains: chain H for A, and chain I for B.
  • 1vql — “The structure of the transition state analogue ‘DCSN’ bound to the large ribosomal subunit of Haloarcula marismortui”. The three-nt fragment DA179—C180—C181 on chain 4 is in the 3’—>5’ direction.
  • 4r3i — “The crystal structure of m(6)A RNA with the YTHDC1 YTH domain”. The mmCIF file has a model number of 0, instead of 1 (as in other cases I am aware of).



Highlights of recent developments of 3DNA/DSSR

Dear 3DNA Forum subscribers,

Here are some highlights of recent developments of 3DNA/DSSR:

Note: If you’ve difficulty in accessing the 3DNA homepage, possibly the case from mainland China (as I know it), please visit its duplicate at This newsletter is written in Markdown, with a translated HTML version posted on the 3DNA homepage.

3DNA v2.3

  • The C source code is now available. Since the programs are written in strict ANSI C, 3DNA can be compiled (as is) on any computers/operating systems with a C (or C++) compiler. For user convenience, three binary distributions (with source code under the src/ subdirectory) are provided for Windows, Linux, and Mac OS X. The distributed Windows version works in native Windows (7 and up, via the cmd command-line interface, or ConEMU), MinGW/Msys (Msys2), and Cygwin, in either 32 or 64-bit.

  • A new set of ‘simple’ base-pair and step parameters was introduced to give ‘intuitive’ numerical values for non-Watson-Crick base pairs and associated steps. See the short communication titled Characterization of base pair geometry in the January 2016 issue of Computational Crystallography Newsletter (CCN).

  • The fiber program includes a new option, --pauling, for easy generation of Pauling & Corey triplex models of DNA/RNA with arbitrary base sequence. See my blogpost titled Pauling’s triplex model of nucleic acids is available in 3DNA.

  • Thomas Holder (PyMOL Principal Developer at Schrödinger, Inc.) has built a PyMOL wrapper to 3DNA fiber models. Now generating standard, regular DNA/RNA models in PyMOL is straightforward — thanks, Thomas!

DSSR (Dissecting the Spatial Structure of RNA)

  • Selected features of DSSR have been incorporated into Jmol (in collaboration with Robert Hanson, Jmol Principal Developer), and PyMOL (in collaboration with Thomas Holder). In Jmol application (via the Console window), one can now, for example, load =1ehz/dssr and then select hairpins; color red to see where the three hairpin loops are in 3D. The Jmol-DSSR web interface makes DSSR-enhanced visualization of nucleic acid structures in Jmol readily accessible to a broad user base, and has been employed in classes for educational purpose. A sample image of DSSR-derived cartoon-block representation via PyMOL is available for PDB entry 5dww, which has a G-quadruplex-duplex interface.

  • Since the publication of the Nucleic Acids Research paper in 2015, DSSR has been continuously refined and expanded, with a total of 36 new releases (from v1.2.8 to v1.6.4) as of this writing. Notably, the --json option provides DSSR-derived parameters in the simple, structured, and standard JSON format that can be easily parsed. This JSON output format is the (preferred) way for the outside world to interface with DSSR, and the Jmol-DSSR integration is built upon it. The --nmr option allows for batch processing of MODEL/ENDMDL-delineated NMR ensembles or trajectories of molecular dynamics (MD) simulations. Did you know that scripts and data files for reproducing the reported results are available in the DSSR-NAR paper section on the 3DNA Forum?

  • The User Manual is now 88-page long, covering nevertheless only the most common use cases of what DSSR has to offer. Miss a feature that you would like to have? Maybe it is already there or can be easily implemented in DSSR. Simply ask (on the 3DNA Forum), and I’ll try my best to help.

SNAP (Structures of Nucleic Acid-Protein complexes)

  • SNAP aims to consolidate, refine, and significantly extend commonly used functionalities for DNA/RNA-protein structural analysis in one easy-to-use program. Currently in beta testing, SNAP is already fully functional, with features for characterizing the protein-nucleic acid interface and identifying amino acid-base pairing and stacking interactions.

A note for 3DNA/DSSR users in mainland China: It’s a pleasure to see the ~100 registrations on the 3DNA Forum with emails ending in .cn,, or etc., mostly from recent years. I’m planning a trip to China in 2017, and I’d be happy to meet some of you for academic exchanges and possible collaborations (学术交流、合作). If you’re interested, let’s get in touch!

Best regards,


Dr. Xiang-Jun Lu (律祥俊)

Comment [4]


Pauling's triplex model of nucleic acids is available in 3DNA

In 1953, Pauling and Corey published an influential paper, titled A proposed structure for the nucleic acids, in Proc. Natl. Acad. Sci. (PNAS). Key features of the proposed model is summarized in their Letter to Nature, Structure of the Nucleic Acids, published in Nature on February 21, 1953.

We have formulated a structure for the nucleic acids which is compatible with the main features of the X-ray diagram and with the general principles of molecular structure, and which accounts satisfactorily for some of the chemical properties of the substances. The structure involves three intertwined helical polynucleotide chains. Each chain, which is formed by phosphate di-ester groups and linking β-D-ribofuranose or β-D-deoxyribofuranose residues with 3′, 5′ linkages, has approximately twenty-four nucleotide residues in seven turns of the helix. The helixes have the sense of a right-handed screw. The phosphate groups are closely packed about the axis of the molecule, with the pentose residues surrounding them, and the purine and pyrimidine groups projecting radially, their planes being approximately perpendicular to the molecular axis. The operation that converts one residue to the next residue in the polynucleotide chain is rotation by about 105° and translation by 3.4 Å.

This triplex model of nucleic acids, with phosphates in the center and bases on the outside, turned out to be fundamentally flawed. Yet, it played a significant role by prompting Watson and Crick in their discovery of the DNA double helix structure. While I’ve been aware of the Pauling triplex model from long ago, I had not read the original Pauling & Corey PNAS paper. Not surprisingly, I did not know what the triplex structure really looks like, other than some general ideas.

In a recent trip to Rutgers, Dr. Wilma Olson and I discussed the applications of fiber models collected in 3DNA. She drew my attention to the Pauling triplex model, and showed me Table 1 of the PNAS paper (see below), where the atomic coordinates for a nucleic acid repeating unit are listed.

Atomic coordinates of the Pauling triplex

The cylindrical format is the same as that for the fiber models in 3DNA. It thus seems fitting to add this historically significant triplex model to the collection. Googling revealed many interesting historical notes and comments, e.g. The Pauling-Corey Structure of DNA, and a short video Linus Pauling’s triple DNA helix model, 3D animation with basic narration. However, I failed to find a program that I can use to generate such a triplex model with generic base sequence. I decided to add the fiber --pauling option so users can easily create such a triplex model in 3D, just as they do for a classic A- and B-DNA duplex. This process has turned out to be very educational (detailed below), and the end result should be of general interest.

3D image of the repeating unit (cytosine) in Pauling triplex

  • The left 3D image shows the nomenclature of atoms used by Pauling & Corey (see Table 1 above), which is dramatically different from current conventions. As an example, it should be the N1 atom of cytosine (a pyrimidine base), not N3, that is connected to the sugar C1′ atom in nowadays nomenclature. The corrections apply not only to base atoms, but also to the sugar and phosphate groups. The revised atom labeling (as used in the PDB) is illustrated in the 3D image on the right.
  • Table 1 corresponds to the ribose sugar since it contains an O2′ atom (see also the figure above). The triplex model constructed would be RNA, but can be ‘converted’ to DNA by simply removing the O2′ atom (see below).
  • Only the atomic coordinates for cytosine are listed in Table 1. The 3DNA mutate_bases program came handy to get the corresponding atomic coordinates for A, G, T, and U. This expansion allows for the generation of Pauling’s triplex models with an arbitrary combination of the five common bases (A, C, G, T, and U).
  • With the new fiber --pauling option, now users can conveniently generate a Pauling’s triplex RNA/DNA model as shown below. Note that the one dash variant -pauling also works fine, with the additional -dna for DNA deoxyribose sugar. The PDB file (Pauling-triplex-mixed.pdb) with mixed DNA sequences can be downloaded, and the corresponding 3D image in top and side views is shown in the following figure.
        fiber -pauling triplex-C10C10C10.pdb        # default: 10 Cs per strand
        fiber -pauling -seq=AAA triplex-A3A3A3.pdb  # 3 As per strand
        fiber -pauling -seq=AAAA:CCCC:GGGG Pauling-triplex-A4C4G4.pdb
        fiber -pauling -seq=ACGGUU,UUGGAC,GGAACC  Pauling-triplex-mixed.pdb
        fiber --pauling-dna -seq=ACGGTT,TTGGAC,GGAACC  Pauling-triplex-DNA.pdb

Sample Pauling DNA triplex generated with 3DNA

  • With 3DNA’s find_pair/analyze pair of programs, one can get the structural parameters corresponding to the Pauling triplex model. Not surprising, the repeating dinucleotide along each strand has a twist of 105°, and a rise of 3.4 Å. Notably, the sugar has a C2′-endo conformation.



3DNA fiber models

3DNA contains 55 fiber models compiled from literature, plus a derived RNA model (as of v2.1). To the best of my knowledge, this is the most comprehensive collection of regular DNA/RNA models. Please see Table 4 of the 2003 3DNA NAR paper for detailed structural features of these models and references.

The 55 models are based on the following works:

  • Chandrasekaran & Arnott (from #1 to #43) — the most well-known set of fiber models
  • Alexeev et al. (#44-#45)
  • van Dam & Levitt (#46-#47)
  • Premilat & Albiser (#48-#55)

The utility program fiber makes the generation of all these fiber models in a simple, consistent interface, and produces coordinate files in either PDB or PDBML format. Of those models, some can be built with an arbitrary sequence of A, C, G and T (e.g., A-/B-/C-DNA from calf thymus), while others are of fixed sequences (e.g., Z-DNA with GC repeats). The sequence can be specified either from command-line or a plain text file, in either lower, UPPER, or MixED cases.

Once 3DNA in properly installed, the command-line interface is the most versatile and convenient way to generate, e.g., a regular double-stranded DNA (mostly, B-DNA) of arbitrary sequence. The command-help message (generated with fiber -h) is as below:

        fiber - generate 55 fiber models based on Arnott and other's work
        fiber [OPTION] PDBFILE
        generate 55 fiber models based on the repeating unit from Arnott's
        work, including the canonical A-, B-, C- and Z-DNA, triplex, etc
        -xml     output structure coordinates in PDBML format
        -num     a structure identification number in the range (1-55)
        -m, -l   brief description of the 55 fiber structures
        -a, -1   A-DNA model (calf thymus)
        -b, -4   B-DNA (calf thymus, default)
        -c, -47  C-DNA (BII-type nucleotides)
        -d, -48  D(A)-DNA  ploy d(AT) : ploy d(AT) (right-handed)
        -z, -15  Z-DNA poly d(GC) : poly d(GC)
        -rna     for RNA with arbitrary base sequence
        -seq=string specifying an arbitrary base sequence
        -single  output a single-stranded structure
        -h       this help message (any non-recognized options will do)
        An structural identification number (symbol)
        fiber fiber-BDNA.pdb
            # fiber -4 fiber-BDNA.pdb
            # fiber -b fiber-BDNA.pdb
        fiber -a fiber-ADNA.pdb
        fiber -seq=AAAGGUUU -rna fiber-RNA.pdb
        fiber -seq=AAAGGUUU -rna -single fiber-ssRNA.pdb
        PDB file
        analyze, anyhelix, find_pair
        3DNA v2.3-2016sept06, created and maintained by Xiang-Jun Lu (PhD)

Please post questions/comments on the 3DNA Forum:

Moreover, the w3DNA, 3D-DART web-interfaces, and the PyMOL wrapper make it easy to generate a regular DNA (or RNA) model, especially for occasional users or for educational purposes.

In principle, nothing is worth showing off with regard to 3DNA’s fiber model generation functionality. Nevertheless, this handy tool serves as a clear example of the differences between a “proof of concept” and a pragmatic software application. I initially decided to work on this tool simply for my own convenience. At that time, I had access to A-DNA and B-DNA fiber model generators, each as a separate program. Moreover, the constructed models did not comply to the PDB format in atom naming, among other subtitles.

I started with the Chandrasekaran & Arnott fiber models which I had a copy of data files. However, there were many details to work out, typos to correct, etc. to put them in a consistent framework. For other models, I had to read each original publication, and to type raw atomic cylindrical coordinates into computer. Again, quite a few inconsistencies popped up between the different publications with a time span over decades.

Overall, it was a quite tedious undertaking, requiring great attention to details. I am glad that I did that: I learned so much from the process, and more importantly, others can benefit from my effort. As I put in the 3DNA Nature Protocol paper (BOX 6 | FIBER-DIFFRACTION MODELS),

In preparing this set of fiber models, we have taken great care to ensure the accuracy and consistency of the models. For completeness and user verification, 3DNA includes, in addition to 3DNA-processed files, the original coordinates collected from the literature.

For those who want to understand what’s going on under the hood, there is no better way than to try to reproduce the process using, e.g., fiber B-DNA as an example.

From the very beginning, I had expected the 3DNA fiber functionality to serve as a handy tool for building a regular DNA duplex of chosen sequence. Over the years, the fiber program has gradually attracted attention from the community. The recent PyMOL wrapper by Thomas Holder is a clear sign of its increased popularity, and has prompted me to write this post, adapted largely from the one titled Fiber models in 3DNA make it easy to build regular DNA helices (dated Friday, October 9, 2009).

See also PyMOL wrapper to 3DNA fiber models


Given below is the content of the README file for fiber models in 3DNA:

1. The repeating units of each fiber structure are mostly based on the
   work of Chandrasekaran & Arnott (from #1 to #43). More recent fiber
   models are based on Alexeev et al. (#44-#45), van Dam & Levitt (#46
   -#47) and Premilat & Albiser (#48-#55).

2. Clean up of each residue
   a. currently ignore hydrogen atoms [can be easily added]
   b. change ME/C7 group of thymine to C5M
   c. re-assign O3' atom to be attached with C3'
   d. change distance unit from nm to A [most of the entries]
   e. re-ordering atoms according to the NDB convention

3. Fix up of problem structures.
   a. str#8 has no N9 atom for guanine
   b. str#10 is not available from the disk, manually input
   c. str#14 C5M atom was named C5 for Thymine, resulting two C5 atoms
   d. str#17 has wrong assignment of O3' atom on Guanine
   e. str#33 has wrong C6 position in U3
   f. str#37 to #str41 were typed in manually following Arnott's
        new list as given in "Oxford Handbook of Nucleic Acid Structure"
        edited by S. Neidle (Oxford Press, 1999)
   g. str#38 coordinates for N6(A) and N3(T) are WRONG as given in the
        original literature
   h. str#39 and #40 have the same O3' coordinates for the 2nd strand

4. str#44 & 45 have fixed strand II residues (T)

5. str#46 & 47 have +z-axis upwards (based on BI.pdb & BII.pdb)

6. str#48 to 55 have +z-axis upwards

List of 55 fiber structures

id#  Twist   Rise        Structure description
    (dgrees)  (A)
 1   32.7   2.548  A-DNA  (calf thymus; generic sequence: A, C, G and T)
 2   65.5   5.095  A-DNA  poly d(ABr5U) : poly d(ABr5U)
 3    0.0  28.030  A-DNA  (calf thymus) poly d(A1T2C3G4G5A6A7T8G9G10T11) :
                                        poly d(A1C2C3A4T5T6C7C8G9A10T11)
 4   36.0   3.375  B-DNA  (calf thymus; generic sequence: A, C, G and T)
 5   72.0   6.720  B-DNA  poly d(CG) : poly d(CG)
 6  180.0  16.864  B-DNA  (calf thymus) poly d(C1C2C3C4C5) : poly d(G6G7G8G9G10)
 7   38.6   3.310  C-DNA  (calf thymus; generic sequence: A, C, G and T)
 8   40.0   3.312  C-DNA  poly d(GGT) : poly d(ACC)
 9  120.0   9.937  C-DNA  poly d(G1G2T3) : poly d(A4C5C6)
10   80.0   6.467  C-DNA  poly d(AG) : poly d(CT)
11   80.0   6.467  C-DNA  poly d(A1G2) : poly d(C3T4)
12   45.0   3.013  D-DNA  poly d(AAT) : poly d(ATT)
13   90.0   6.125  D-DNA  poly d(CI) : poly d(CI)
14  -90.0  18.500  D-DNA  poly d(A1T2A3T4A5T6) : poly d(A1T2A3T4A5T6)
15  -60.0   7.250  Z-DNA  poly d(GC) : poly d(GC)
16  -51.4   7.571  Z-DNA  poly d(As4T) : poly d(As4T)
17    0.0  10.200  L-DNA  (calf thymus) poly d(GC) : poly d(GC)
18   36.0   3.230  B'-DNA alpha poly d(A) : poly d(T) (H-DNA)
19   36.0   3.233  B'-DNA beta2 poly d(A) : poly d(T) (H-DNA  beta)
20   32.7   2.812  A-RNA  poly (A) : poly (U)
21   30.0   3.000  A'-RNA poly (I) : poly (C)
22   32.7   2.560  Hybrid poly (A) : poly d(T)
23   32.0   2.780  Hybrid poly d(G) : poly (C)
24   36.0   3.130  Hybrid poly d(I) : poly (C)
25   32.7   3.060  Hybrid poly d(A) : poly (U)
26   36.0   3.010  10-fold poly (X) : poly (X)
27   32.7   2.518  11-fold poly (X) : poly (X)
28   32.7   2.596  Poly (s2U) : poly (s2U) (symmetric base-pair)
29   32.7   2.596  Poly (s2U) : poly (s2U) (asymmetric base-pair)
30   32.7   3.160  Poly d(C) : poly d(I) : poly d(C)
31   30.0   3.260  Poly d(T) : poly d(A) : poly d(T)
32   32.7   3.040  Poly (U) : poly (A) : poly(U) (11-fold)
33   30.0   3.040  Poly (U) : poly (A) : poly(U) (12-fold)
34   30.0   3.290  Poly (I) : poly (A) : poly(I)
35   31.3   3.410  Poly (I) : poly (I) : poly(I) : poly(I)
36   60.0   3.155  Poly (C) or poly (mC) or poly (eC)
37   36.0   3.200  B'-DNA beta2  Poly d(A) : poly d(U)
38   36.0   3.240  B'-DNA beta1  Poly d(A) : poly d(T)
39   72.0   6.480  B'-DNA beta2  Poly d(AI) : poly d(CT)
40   72.0   6.460  B'-DNA beta1  Poly d(AI) : poly d(CT)
41  144.0  13.540  B'-DNA  Poly d(AATT) : poly d(AATT)
42   32.7   3.040  Poly(U) : poly d(A) : poly(U) [cf. #32]
43   36.0   3.200  Beta Poly d(A) : Poly d(U) [cf. #37]
44   36.0   3.233  Poly d(A) : poly d(T) (Ca salt)
45   36.0   3.233  Poly d(A) : poly d(T) (Na salt)
46   36.0   3.38   B-DNA (BI-type nucleotides; generic sequence: A, C, G and T)
47   40.0   3.32   C-DNA (BII-type nucleotides; generic sequence: A, C, G and T)
48   87.8   6.02   D(A)-DNA  ploy d(AT) : ploy d(AT) (right-handed)
49   60.0   7.20   S-DNA  ploy d(CG) : poly d(CG) (C_BG_A, right-handed)
50   60.0   7.20   S-DNA  ploy d(GC) : poly d(GC) (C_AG_B, right-handed)
51   31.6   3.22   B*-DNA  poly d(A) : poly d(T)
52   90.0   6.06   D(B)-DNA  poly d(AT) : poly d(AT) [cf. #48]
53  -38.7   3.29   C-DNA (generic sequence: A, C, G and T) (depreciated)
54   32.73  2.56   A-DNA (generic sequence: A, C, G and T) [cf. #1]
55   36.0   3.39   B-DNA (generic sequence: A, C, G and T) [cf. #4]
List 1-41 based on Struther Arnott: ``Polynucleotide secondary structures:
     an historical perspective'', pp. 1-38 in ``Oxford Handbook of Nucleic
     Acid Structure'' edited by Stephen Neidle (Oxford Press, 1999).

     #42 and #43 are from Chandrasekaran & Arnott: "The Structures of DNA
     and RNA Helices in Oriented Fibers", pp 31-170 in "Landolt-Bornstein
     Numerical Data and Functional Relationships in Science and Technology"
     edited by W. Saenger (Springer-Verlag, 1990).

#44-#45 based on Alexeev et al., ``The structure of poly(dA) . poly(dT)
     as revealed by an X-ray fiber diffraction''. J. Biomol. Str. Dyn, 4,
     pp. 989-1011, 1987.

#46-#47 based on van Dam & Levitt, ``BII nucleotides in the B and C forms
     of natural-sequence polymeric DNA: a new model for the C form of DNA''.
     J. Mol. Biol., 304, pp. 541-561, 2000.

#48-#55 based on Premilat & Albiser, ``A new D-DNA form of poly(dA-dT) .
     poly(dA-dT): an A-DNA type structure with reversed Hoogsteen Pairing''.
     Eur. Biophys. J., 30, pp. 404-410, 2001 (and several other publications).



PyMOL wrapper to 3DNA fiber models

Recently, I heard from Thomas Holder, the PyMOL Principal Developer (Schrödinger, Inc.), that he had written a wrapper to the 3DNA fiber command. This PyMOL wrapper is implemented as part of his versatile PSICO library (see the PyMOL Wiki page Psico for details), and exposes the 55 fiber models based on Arnott and other’s work to the wide PyMOL user community. Moreover, the wrapper can be accessed directly from PyMOL (without installing PSICO), as shown below with an example:

PyMOL> run

The resulting fiber model is the default B-form DNA of calf thymus, with twist of 36.0° and rise of 3.375 Å (see figure below). Note that cases in base sequence do not matter, so fiber ctagcg or fiber CTAgcg will give the same result.

The 3DNA fiber tool in PyMOL

Running PyMOL>help fiber gives the following detailed usages info, which should be sufficient to get one started with this fiber tool in PyMOL.

PyMOL> help fiber


    Run X3DNA's "fiber" tool.

    For the list of structure identification numbers, see for example:


    fiber seq [, num [, name [, rna [, single ]]]]


    seq = str: single letter code sequence or number of repeats for
    repeat models.

    num = int: structure identification number {default: 4}

    name = str: name of object to create {default: random unused name}

    rna = 0/1: 0=DNA, 1=RNA {default: 0}

    single = 0/1: 0=double stranded, 1=single stranded {default: 0}


    # environment (this could go into ~/.pymolrc or ~/.bashrc)
    os.environ["X3DNA"] = "/opt/x3dna-v2.3"

    # B or A DNA from sequence
    fiber CTAGCG
    fiber CTAGCG, 1, ADNA

    # double or single stranded RNA from sequence
    fiber AAAGGU, name=dsRNA, rna=1
    fiber AAAGGU, name=ssRNA, rna=1, single=1

    # poly-GC Z-DNA repeat model with 10 repeats
    fiber 10, 15 

Thanks to Thomas, for making another connection between PyMOL and 3DNA/DSSR. The other one is the DSSR-plugin for PyMOL to create “block” shaped cartoons for nucleic acid bases and base pairs.

See also 3DNA fiber models



3DNA C source code is available

As of release v2.3-2016sept06, the C source code of the 3DNA software package is available. The code can be found in the $X3DNA/src folder of the distributed tarballs for Linux, Mac OS X, and Windows. Since 3DNA is written in pure ANSI C, it can be compiled without changes on any platform with a modern C compiler.

The original codebase of 3DNA was written around year 2000. Up until v2.3, the infrastructure of 3DNA has remained stable for 16 years. During the time, 3DNA has been widely adopted in other bioinformatics pipelines and cited over 1,500 times. Over the years, I’ve received quite a few requests for 3DNA source code. However, due to complications of various factors (including software licensing), 3DNA had only been distributed in executable forms for the crucial C programs. Now, the C code of 3DNA is finally open source!

As before, users need to register on the 3DNA Forum to download the software. The download page also includes x3dna-v2.0.tar.gz that accompanied the 2008 Nature Protocols paper, and x3dna-v1.5.tar.gz that corresponded to the 2003 Nucleic Acids Research paper. Other than minor revisions to pass strict gcc compiler options, the v1.5 and v2.0 codebases are kept as they were. 3DNA is backward-compatible as far as the key base-pair parameters are concerned. Moreover, between v1.5 and v2.0, the command-line interface stays the same. The two previous versions are released for historical reasons. For example, one may notice some obvious “similarities” between 3DNA v1.5 and RNAView.

The development of DSSR and SNAP will push 3DNA into a brand new version (v3), which contains significant changes in functionality and interface, and is no longer compatible with previous versions. I intend to keep 3DNA v2.3 in a ‘maintenance’ mode: no new features are planed, but bug reports and user questions will be promptly addressed on the 3DNA Forum, as always. Making 3DNA open source should help further prompt its adoptions, and adaptations in structural bioinformatics of nucleic acids.

There are numerous types of software licenses, but none of them seems to be a good fit for my purpose. As a result, I’ve come up with a permissive “citation-ware” license with contents as below:

3DNA is a suite of software programs for the analysis,
rebuilding and visualization of 3-Dimensional Nucleic Acid
structures. Permission to use, copy, modify, and distribute
this suite for any purpose, with or without fee, is hereby
granted, and subject to the following conditions:

At least one of the 3DNA papers must be cited, including the
following two primary ones:

   1. Lu, X. J., & Olson, W. K. (2003). "3DNA: a software
      package for the analysis, rebuilding and visualization
      of three‐dimensional nucleic acid structures." Nucleic
      Acids Research, 31(17), 5108-5121.

   2. Lu, X. J., & Olson, W. K. (2008). "3DNA: a versatile,
      integrated software system for the analysis,
      rebuilding and visualization of three-dimensional
      nucleic-acid structures." Nature Protocols, 3(7),


Any 3DNA-related questions, comments, and suggestions are
welcome and should be directed to the open 3DNA Forum



DSSR is used in the URS DataBase

Recently, I came across the article URS DataBase: universe of RNA structures and their motifs by Baulin et al. in Database, an online journal of biological databases and curation. I am glad to see that DSSR is used in the URSDB, as quoted below.

In the “Input data” subsection of “Materials and methods”:

RNA-containing structures were extracted from the PDB in mmCIF format; each file was divided into models. The base pairs (both canonical and non-canonical) and dinucleotide steps were annotated using the DSSR program from 3DNA toolkit (26). We also exploited detailed information provided by DSSR on given elements such as geometric parameters, types according to different classifications and various details on base conformations.

Moreover, under “Future development”, the authors said:

We plan to perform a comparative analysis of programs that annotate base pairs in RNA-containing PDB files. We will consider the four most popular programs, FR3D (35), MC-Annotate (36), RNAView (37) and DSSR (26). According to the analysis the annotation of the base pairs will be refined. In addition, we plan to include in the database annotations of base-phosphate, base-ribose and base stacking contacts and to implement search of such data.

It is gratifying to see DSSR listed as one of “the four most popular programs” for annotating RNA base pairs. It’d also be interesting to see how DSSR compares with FR3D, MC-Annotate, and RNAView from the user’s perspective.



DSSR in the validation of NMR RNA structures

With great interest, I read the article titled Improving NMR Structures of RNA by Bermejo et al. As is well-known, solution NMR structures of RNA normally exhibit more steric clashes and conformational ambiguities than their crystal X-ray counterparts. The paper introduces an improved force field, RNA-ff1, for structure elucidation with Xplor-NIH. By adopting realistic atom radii and a new statistical torsional potential, the RNA-ff1 force field significantly enhances covalent geometry and MolProbity validation scores (in steric contacts and backbone conformation) in the seven tested NMR datasets.

I am glad to see that DSSR is mentioned in the Section titled Analysis of Known Structural Motifs:

… The program DSSR (Lu et al., 2015) (part of the 3DNA software suite [Lu and Olson, 2003, 2008]) was used to evaluate the stacking configuration of successive base pairs (i.e., ‘‘steps’’) within the helical stems of the systems in the present calculations. The most interesting trends are observed for the base-pair step parameters slide (Figure 4K) and rise (Figure 4L), which respectively measure an in-plane dislocation and the vertical displacement of a step relative to a local mid-step reference frame (Lu and Olson, 2003; for analysis of all step parameters, see Figure S1). Relative to A-form parameters in high-resolution X-ray structures (Olson et al., 2001) (Figures 4K and 4L, dashed lines), the average slide of all but one of the original NMR models (PDB: 1O15) is small in absolute value (Figure 4K). … Moreover, four out of the seven original PDB models display an average rise considerably larger than the expected 3.32 Å (the van der Waals separation distance between bases, not to be confused with the helical rise, measured relative to the helical axis, expected to be 2.83 Å for A-form [Olson et al., 2001]).

As an example, the single stem of PDB: 2KOC’s representative structure, assumed to be an A-form helix (Nozinovic et al., 2010), displays a particularly large separation between base pairs C3–G12 and A4–U11 (rise: 4.33 Å) that is visually evident when compared with that of the RNA-ff1 representative model (rise: 3.33 Å ) (Figure 6A). Indeed, this base-pair step defies conformational classification by DSSR in the PDB: 2KOC structure, while it is assigned as A-form (along with the rest of the stem) in the RNA-ff1 structure.

Through the text, the term “stem” or “helical stem” is used consistently, in line with the nomenclatures adopted by DSSR. It is worth noting that DSSR also derives a complete set of backbone conformational parameters, including the assignment of sugar-phosphate backbone suites. The backbone parameters constitutes only a small portion of what DSSR has to offer, and they are written to the auxiliary file dssr-torsions.txt by default.



The DSSR --block-color option

Upon user requests, I’ve recently introduced the --block-color option to DSSR, available as of v1.5.2-2016apr02. As its name implies, the --block-color option facilitate user customization of PyMOL rendered colors of the base rectangular blocks or their edges (e.g., the minor-groove) directly from the command-line. A simple example goes like this: --block-color='A blue; T red', which makes A colored blue and T colored red. As detailed below, the new option is very flexible with regard to the specification of colors, bases, or some edges to highlight. Before that, a little background is in order.

Background info

The DSSR cartoon-block representation follows the color convention of the original 3DNA blocview script, where A is red; C is yellow; G is green; T is blue; and U is cyan. If I remember correctly, the blocview coloring was based on the scheme adopted by the Nucleic Acid Database (NDB). To allow for some flexibility, 3DNA includes a config file named $X3DNA/config/raster3d.par where users can change the RGB values of the corresponding bases. However, I do not know if any user has ever bothered to play around with the configuration file for customized base colors.

Over the years, blocview-generated images have become popular, due to its simplicity, and (maybe more importantly) its endorsement by the NDB and PDB for nucleic acid structures. Via NDB, the blocview-generated images have also been used in RNA FRABASE 2.0 and RNA Structure Atlas. Nevertheless, the blocview script has several dependencies: MolScript for protein or DNA/RNA backbone ribbons, render from Raster3D for rendering, and ImageMagick for image processing. Moreover, the blocview script used by NDB/PDB is (likely to be) based on 3DNA v1.5, the last version before I left Rutgers in 2002.

Over the years, 3DNA has been continuously refined, with significant changes introduced in v2.0 around 2008 to accompany the Nature Protocols paper. Currently at v2.3, the codebase for 3DNA version 2 is in maintenance mode: the software will still be supported with identified bugs fixed, but no more new feature is planned. 3DNA version 3, as represented by DSSR and SNAP, is the way to go.

DSSR has no third-party dependencies

While creating DSSR, I set it as one of the design goals to make the program fully self-contained, without any third-party dependencies. Connections to other tools are clearly delineated via text files. If anything goes wrong, one can easily identify where the problem is. Experience over the past few years has unambiguously proved the effectiveness of this zero-dependency approach. Other than being directly distributed with an operating system, DSSR is the easiest to get up and running. Moreover, DSSR can be easily integrated into other pipelines, including Jmol and PyMOL, among many other bioinformatics tools.

For the cartoon-block representation, DSSR produces .r3d files that can be loaded into PyMOL, mixed and matched with other visualization styles PyMOL has to offer. No more direct dependencies on MolScript, Raster3D, and ImageMagick as is the case for blocview. It is also worth mentioning that DSSR does not need PyMOL to run. DSSR and PyMOL are connected via .r3d files, a process which can be streamlined with the Dssr_block PyMOL plugin.

DSSR releases before v1.5.2-2016apr02 have the color coding of base blocks fixed within the source code, following the default style of blocview. Over the past few months, I’ve received at least two explicit requests on customizing the default colors of DSSR-generated base blocks. The --block-color option has been introduced for this purpose.

Details of the --block-color option

The general format of the option is as follows:

--block-color='id color [; id2 color2 ...]'
  • id can be A, C, G, T, U, or the degenerated IUPAC code, including R, Y, N etc. See UPAC nucleotide code for details.
  • id can also be minor, major, upper, bottom, wc-edge to specify one of the six faces of a 3D rectangular block. See Fig.1D of the DSSR paper for details.

Fig.1D (DSSR 2015 NAR paper)

  • id can further be GC, AT, GU, pair, and variants thereof, to specify the colors of the corresponding long base-pair rectangular blocks.
  • color can be a common name (144 total), as specified in the RGB Color website. For example, red, magenta, light gray etc.
  • color can also be a single number in the range [0, 1] or [0, 255] to specify a shade of gray. DSSR repeat the number twice to get the RGB triple consisting of the same number.
  • color can further be a set of three space-delimited numbers to specify the RGB triple. Again, the number can be in [0, 1] or [0, 255]. Moreover, the three numbers can be put in square brackets. For example --block-color='A 0 1 1' and --block-color='A [0 1 1]' specify adenine to be colored with RGB triple [0 1 1] (aqua/cyan, corresponding to --block-color='A cyan').
  • More than one identity (bases) can be specified, separated by ; (,, :, or | also works). Note: within the PyMOL dssr_block plugin, only | or : can be used as a separator: comma (,) or semicolon (;) cannot be used as a separator within a PyMOL command argument (thanks to Thomas Holder for drawing this point to my attention).
  • Case does not matter when specifying id or color. So either ‘A’ or ‘a’, and ‘blue’ or ‘Blue’ or ‘BLUE’ can be used to make adenine blue: --block-color='a blue'.

Some example usages

While the above description may appears to be quite complicated, the actual usage of the --block-color option is very straightforward. As always, the cases are best made with concrete examples, as shown below using the classic Dickerson B-DNA dodecamer 355d.

# all bases in blue
x3dna-dssr -i=355d.pdb --cartoon-block=orient --block-color='N blue' -o=355d-all-blue.pml
# all WC-pairs in red, with the minor-groove edge in 'dim gary'
x3dna-dssr -i=355d.pdb --cartoon-block=orient --block-color='wc-pair red; minor dim gray' -o=355d-pair-minor.pml
# thymine (T) in purple, and the upper (+z) face in white
# see Figure below, which shows the two bases in WC-pairs are anti-parallel
x3dna-dssr -i=355d.pdb --cartoon-block=orient --block-color='T purple; upper 1' -o=355d-T-upper.pml

T-colord purple, +z (upper) faces white



Cartoon-block representation of quadruplex-duplex interface

Recently I read the article titled Structural Insights into the Quadruplex−Duplex 3′ Interface Formed from a Telomeric Repeat: A Potential Molecular Target by Krauss et al.. I quickly ran DSSR on the corresponding PDB entry is 5dww. Not surprisingly, DSSR can automatically identify reported key structural features (see output file 5dww.out for details), including the TAT triplet at the quadruplex−duplex junction, and the three G-quartets. Note that the result is based on biological assembly 1 in PDB file 5dww.pdb1 since the asymmetric unit contains four such molecules.

List of 4 multiplets
   1 nts=3 TAT 1:A.DT17,1:A.DA19,1:B.DT7
   2 nts=4 GGGG 1:A.DG1,1:A.DG5,1:A.DG9,1:A.DG14
   3 nts=4 GGGG 1:A.DG2,1:A.DG6,1:A.DG10,1:A.DG15
   4 nts=4 GGGG 1:A.DG3,1:A.DG7,1:A.DG11,1:A.DG16

As its title suggests, however, this blog post is about the cartoon-block representations. Four styles of such schematics are shown below, which can all be easily generated using DSSR/PyMOL.

Cartoon-block of 5dww in default style Cartoon-block of 5dww with base-pair blocks
in default style with base-pair blocks
Cartoon-block of 5dww with minor-groove highlighted Cartoon-block of 5dww with top-face highlighted
minor-groove highlighted top-face highlighted

The cartoon-block representations possess unique features not seen elsewhere. With the help of the dssr_block in PyMOL, they are extremely easy to generate. Such schematics are likely to become popular in illustrations of nucleic acid structures.



Integrating DSSR into Jmol and PyMOL

Over the past couple of years, one of the most significant achievements of DSSR has been its integration into Jmol and PyMOL, two widely used molecular graphics programs. None of the projects had been ‘planned’, and I am honored to have the opportunities collaborating directly with Bob Hanson (Jmol) and Thomas Holder (PyMOL). The integrations make salient features of DSSR readily accessible to the Jmol and PyMOL user communities. Moreover, Jmol and PyMOL take different approaches to interoperate with DSSR, and so far they have employed separate features that the program has to offer.

Key features of DSSR

DSSR was implemented in strict ANSI C as a self-contained command-line program. The binaries for common operating systems (Mac OS X, Linux and Windows) are tiny (<1MB), and without runtime dependencies on third-party libraries. DSSR also comes with an extensive PDF user manual.

Since its initial release in early 2013, DSSR has been continuously refined/expanded based on user feedback and my improved knowledge of RNA structures. User questions are always promptly addressed on the public 3DNA Forum. Over the years, DSSR has gradually established itself as an accountable software product.

The small size, zero configuration, extensive features, and robust performance make DSSR ideal to be integrated into other bioinformatics tools.

DSSR and Jmol

From the very beginning, Jmol has been employing a web-service at Columbia University, where all DSSR analyses take place. In addition to the sample DSSR-Jmol web interface, DSSR is also directly accessible from the console (see Fig.1 below). Jmol includes a sophisticated SQL syntax to drill down the various DSSR-derived structure features. Search ‘DSSR’ on the Jmol/JSmol interactive scripting documentation for details.

DSSR is accessible in Jmol console via scripting Fig. 1 DSSR is available from the Jmol/JSmol console via scripting.

The initial version of the integration (Jmol v14.2) was facilitated by the DSSR --jmol option to produce a Jmol-specific (e.g., residue id [C]2658:A) plain text output. However, ad hoc text file are rigid and fragile for programs to communicate with. As DSSR had been evolving, changes to existing features or newly added functionality were known to break the established DSSR-Jmol interface. Having to write extra code to maintain the same old --jmol output did not feel right.

JSON (JavaScript Object Notation) came to the rescue! The current DSSR-Jmol integration (Jmol v14.4) takes advantage of JSON, a standard, lightweight data-interchange format. Since JSON is structured, parsing its contents is straightforward. DSSR and Jmol can evolve independently, as always, but they no longer need to worry about touching each other’s toes.

Overall, Jmol has incorporated the most fundamental analysis features of DSSR. The Jmol SQL mini-language is very powerful for selecting arbitrary DSSR parameters. Background information about this collaboration can be found in the blog post Jmol and DSSR.


So far, the DSSR-PyMOL integration has focused on visualization, i.e., the cartoon-block schematic representations of DNA/RNA structures. Moreover, instead of relying on a remote DSSR web-service as for Jmol, the PyMOL dssr_block command calls a locally installed DSSR executable for the job. As illustrated in the blogpost DSSR base blocks in PyMOL, interactively, the ‘dssr_block’ command makes it trivial to incorporate the highly effective rectangular blocks into PyMOL.

From early on, 3DNA includes the blocview script (first written in Perl, later converted to Ruby) to generate schematic images in the ‘best view’, by combining block representation of bases with backbone ribbon of proteins or nucleic acids. The script is essentially a glue, calling MolScript, Raster3D, ImageMagick, and several 3DNA utility programs to perform various tasks. With these dependencies, it’s a bit involved to set up blocview. Nevertheless, the resultant images are simple and revealing, and are still being used by NDB and RCSB PDB (among others) as of today.

DSSR does not depend on MolScript and Raster3D, or any other programs to generate .r3d output of rectangular blocks. The schematic blocks can be directly fed into PyMOL, combined with other representations, and ray-traced for high resolution images. The integration of DSSR into PyMOL by the dssr_block command is likely to prompt an even wider adoption of the cartoon-block representation. In this regard, it is well worth noting the news item “dssr_block is a wrapper for DSSR (3dna) and creates block-shaped nucleic acid cartoons” on the main page of PyMOLWiki (see Fig. 2 below). It will certainly bring this neat feature into the attention of many PyMOL users.

dssr_block news item on PyMOLWiki
Fig. 2 Screenshot of the PyMOLWiki main page (2016-01-27) with ‘dssr_block’ in the news. A sample cartoon-block image of 355d is inserted as an example.

Integration of DSSR analysis results into PyMOL is underway, using the same JSON output. Before long, PyMOL users should be able to have access to the numerous DNA/RNA structural features derived by DSSR as in Jmol, along with the cartoon-block images enabled by dssr_block. Background information about DSSR-PyMOL can be found in blog post Open invitation on writing a DSSR plugin for PyMOL.


  • The DSSR-Jmol and DSSR-PyMOL integrations are two salient examples of what can be achieved via direct collaboration of dedicated scientists with complementary expertise. In addition to benefit the involved projects in particular and the (structural biology) community at large, technical and scientific advances are more likely to be achieved.
  • Both projects are still on going, with continued refinements of existing functionality and additions of new features. As an example, it is desirable and likely that Jmol would allow local access to DSSR for efficiency and data privacy.
  • JSON is the way to go for connecting DSSR to the outside world. Period. The obsolete --jmol will be removed from the next release of DSSR (v1.5). The default plain text output is useful for easy comprehension and will stilled be maintained. But do not count on its exact format for computer parsing — occasional changes to existing items are likely, and new features are bound to be added.
  • If you’d like to incorporate DSSR into your pipeline and need some customizations of its output, please let me know. It’s always easier to set things right at the source than to fix them downstream. Where practical, I’ll try to implement your requested features, quickly. Working together, we can and will build a better world.



Characterization of base-pair geometry

This post is a recap of the recently introduced ‘simple’ base-pair (bp) parameters (Fig. 1) useful for describing non-Waton-Crick pairs, and the highly effective cartoon-block representations of nucleic acid structures. Both features are readily available from 3DNA/DSSR, as detailed here using four examples of representative DNA/RNA structures (Fig. 2). Links to related blog posts are provided at the end.

Note added on Feb. 2, 2016: in fact, this post had been intended to supplement a short communication titled Characterization of base-pair geometry that Dr. Wilma Olson and I recently contributed to the January 2016 issue of Computational Crystallography Newsletter (CCN). That’s why the URL of this post is ‘’ instead of what one would expect from the title. The data files, scripts, images, and linked herein should enable interested users a thorough understanding of the ‘simple’ base-pair parameters. If you have problems in reproducing our reported results, please do not hesitate to let me know (publicly). You are welcome to either leave comments to this post or ask any related questions on the 3DNA Forum.

Six rigid-body parameters

Schematic diagram of six rigid-body base-pair parameters

Fig. 1: Schematic diagrams of the six rigid-body parameters commonly used for the characterization of base-pair geometry.

Cartoon-block representations

Cartoon-block representations

Fig. 2: DSSR-introduced cartoon-block representations of DNA and RNA structures that combine PyMOL cartoon schematics with color-coded rectangular base blocks: A, red; C, yellow; G, green; T, blue; and U, cyan. (A) The Dickerson B-DNA dodecamer solved at 1.4-Å resolution [PDB id: 355d (Shui et al., 1998)], with significant negative Propeller. (B) The Z-DNA dodecamer [PDB id: 4ocb (Luo et al., 2014)], with virtually co-planar C–G pairs at the ends, and noticeable Buckle in the middle. © The GUAA tetraloop mutant of the sarcin/ricin domain from E. coli 23 S rRNA [PDB id: 1msy (Correll et al., 2003)], with large Buckle in the A+C pair, and base-stacking interactions of UAA in the GUAA tetraloop (upper-right corner). (D) The parallel double-stranded poly(A) RNA helix [PDB id: 4jrd (Safaee et al., 2013)], with up to +14° Propeller. The simple, informative cartoon-block representations facilitate understanding of the base interactions in small to mid-sized nucleic acid structures like these. The base identity, pairing geometry, and stacking interactions are obvious.

Scripts and data files (Lu-CCN-examples.tar.gz)

find_pair 355d.pdb | analyze   # 355d.out
x3dna-dssr -i=355d.pdb -more -o=355d-dssr.out
x3dna-dssr -i=355d.pdb --cartoon-block -o=355d.pml

find_pair 4jrd.pdb | analyze   # 4jrd.out
x3dna-dssr -i=4jrd.pdb -more -o=4jrd-dssr.out
x3dna-dssr -i=4jrd.pdb --cartoon-block -o=4jrd.pml

find_pair 1msy.pdb | analyze   # 1msy.out
x3dna-dssr -i=1msy.pdb -more -o=355d-dssr.out
x3dna-dssr -i=1msy.pdb --cartoon-block -o=1msy.pml

find_pair --symm 4ocb.pdb1 | analyze --symm  # 4ocb.out
x3dna-dssr -i=4ocb.pdb1 --symm -more -o=4ocb-dssr.out
x3dna-dssr -i=4ocb.pdb1 --symm --cartoon-block -o=4ocb.pml

Please note the following points:

  • The above examples are based on 3DNA v2.3-2016jan20 and DSSR v1.4.8-2016jan16.
  • All data files (including PyMOL ray-traced PNG images used in Fig. 2) are packed into a tarball named Lu-CCN-examples.tar.gz for download.
  • For PDB entry 4ocb, the biological unit (with suffix .pdb1) is used to get a complete duplex structure. The symm option must be specified.
  • PDB files are used in the above illustration. In fact, the corresponding mmCIF files (.cif) also work just fine.
  • The DSSR-derived .pml files can be fed into PyMOL for rendering. In addition to the directly generated *.pml files (e.g., 355d.pml), the PyMOL transformed version (i.e., orient; turn z, -90) are also included, with names *-orient.pml (e.g., 355d-orient.pml). The PNG images (as shown in Fig. 2) are ray-traced using these reoriented pml files for the most extended vertical view.
  • The ‘simple’ base-pair parameters for 4jrd is shown below.
This structure contains 10 non-Watson-Crick (with leading *) base pair(s)
Simple base-pair parameters based on RC8--YC6 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
*    1 A+A      -7.96      0.41     -0.03    -13.64     -4.06   -179.47   14.2
*    2 A+A      -7.86      0.38     -0.33    -10.20     -3.53   -179.34   10.8
*    3 A+A      -7.96      0.43      0.02    -10.15      5.23    179.91   11.4
*    4 A+A      -7.95      0.50      0.10     -9.24      8.04    179.15   12.2
*    5 A+A      -7.95      0.46      0.08     -7.36     10.12   -179.98   12.5
*    6 A+A      -7.97      0.60      0.06     -5.15     12.87   -176.75   13.9
*    7 A+A      -7.88      0.66     -0.02     -7.82     11.89   -179.55   14.2
*    8 A+A      -7.91      0.56     -0.05     -7.03     13.68    179.22   15.4
*    9 A+A      -7.94      0.47     -0.03     -3.78     13.76   -179.24   14.3
*   10 A+A      -7.92      0.42      0.10     -3.03      4.34   -178.91    5.3

Related posts

Comment [2]


DSSR base blocks in PyMOL, interactively

In early 2015, Thomas Holder (the PyMOL Principal Developer at Schrodinger) and I agreed to work together on connecting DSSR to PyMOL. Moreover, we called for the community’s involvement in writing a DSSR plugin for PyMOL and received a few enthusiastic replies. Over the past few months, many significant progresses have been made in DSSR, including an article titled DSSR: an integrated software tool for dissecting the spatial structure of RNA published in Nucleic Acids Research (NAR) and a more streamlined DSSR-Jmol integration based on the --json output.

From the very beginning, Thomas and I had envisioned that the DSSR-PyMOL integration would include two components: one is to bring DSSR-derived RNA/DNA structural features into PyMOL (similar to the DSSR-Jmol interface, funcationality-wise), and the other is to render DSSR’s simple yet informative base-rectangular representations with PyMOL. While the ‘analysis’ component is a work in progress, the ‘visualization’ part is ready for the community to take advantage of.

Thomas has written a Python script named When the script is run in PyMOL, it adds the “dssr_block” command. The script is less than 100 lines including documentation, with the real code taking no more than half of the total line number. The detailed documentation section (with two examples), when condensed, is as follows:

    Create a nucleid acid cartoon with DSSR
    dssr_block [selection [, state [, block_file [, block_depth [, name [, exe]]]]]]
    selection = str: atom selection {default: all}
    state = int: object state (0 for all states) {default: -1, current state}
    block_file = face|edge|wc|equal|minor|gray {default: face}
    block_depth = float: thickness of rectangular blocks {default: 0.5}
    name = str: name of new CGO object {default: dssr_block##}
    exe = str: path to "x3dna-dssr" executable {default: x3dna-dssr}
    fetch 1ehz, async=0
    as cartoon
    set cartoon_ladder_radius, 0.1
    set cartoon_ladder_color, gray
    set cartoon_nucleic_acid_mode, 1
    # multi-state
    fetch 2n2d, async=0
    dssr_block 2n2d, 0
    set all_states

Download the script into a folder (directory) of your choice. Within PyMOL command window, type:

run  # to make the 'dssr_block' command avaible
help dssr_block    # to get the help message, with contents shown above

The resultant cartoon-block image for running the documented commands (except for the additional orient command for best view) for case 1ehz is shown in Fig. 1 below.

DSSR block image for tRNA (1ehz)
Fig. 1: Cartoon-block image generated by for PDB entry 1ehz (yeast phenylalanine tRNA)

For the NMR ensemble 2n2d, the corresponding image (after running orient) is illustrated in Fig. 2 as follows:

DSSR block image for an NMR ensemble (2n2d)
Fig. 2: Cartoon-block image generated by for PDB entry 2n2d (an NMR ensemble).

In addition to the default settings, DSSR offers quite a few variations for the size and coloring of rectangular blocks, as demonstrated in Fig.3. The main settings are through the block_file option in PyMOL (note the underscore), corresponding to DSSR --block-file (or --block_file). The corresponding PyMOL commands are also listed for your reference. You can easily play around with the various styles interactively in PyMOL by toggling objects (dssr_block##) on or off. Enjoy!

DSSR block image for the Dickerson B-DNA dodecamer (355d)
Fig. 3: Cartoon-block image generated by for PDB entry 355d (the Dickerson B-DNA dodecamer).

Fig. 3 is created with the following PyMOL commands:

fetch 355d, async=0
bg_color white

as cartoon
turn z, -90
turn y, 180

set cartoon_ladder_mode, 1
set cartoon_ladder_radius, 0.1
set cartoon_ladder_color, black

set cartoon_tube_radius, 0.5
set cartoon_nucleic_acid_mode, 1
set cartoon_color, gold

dssr_block 355d                  # default base blocks in solid color
dssr_block block_file=edge       # rectangular blocks in wireframe (black)
dssr_block block_file=face+edge  # solid color with outline
dssr_block block_file=equal      # bases blocks in equal size
dssr_block block_file=minor      # with minor-groove colord black
dssr_block block_file=wc         # Watson-Crick base pairs in long bp blocks
dssr_block block_file=wc-minor   # Watson-Crick pairs + minor-groove edge
dssr_block block_file=gray       # rectangular blocks all in gray
dssr_block block_depth=1.8       # with increased thickness


  • The script described here is the original version Thomas communicated to me. Current version of this script and related topics can be found in the Dssr block PyMOLWiki page.
  • For this script to work, DSSR needs to be installed and x3dna-dssr in the PATH.
  • In PyMOL, set cartoon_nucleic_acid_mode, 1 employs C3′ instead of the default P (‘mode 0’) for the smooth backbone trace. Since 5′ terminal phosphate groups are normally not available from X-ray crystal structures (e.g., 355d), ‘mode 1’ is used to avoid orphan base blocks from the backbone trace.

Comment [2]


3DNA Forum is spam free

As of today (2016-01-16), the number of registrations on the 3DNA Forum has reached 2,562. Moreover, all the members (as far as I can tell) are legitimate since the Forum has remained spam free. From the very beginning, ensuring a high information-to-noice ratio has been a top priority. The goal has been achieved by taking the following measures:

State the rules clearly in the “Registration Agreement”

This forum is dedicated to topics generally related to the 3DNA suite of software programs for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. To make the 3DNA forum a more pleasant virtual community for all of us to learn from and contribute to, please be considerate and practice good netiquette (

I strive to make the forum spam free. Specifically, posts that are not 3DNA related in the broad sense are taken as spams, and are strictly forbidden. You are solely responsible for the content of your posts. We reserve the right to remove any post deemed as inappropriate, deactivate the account and ban the IP address of any abuser of the forum, WITHOUT NOTICE.

When posting on the Forum, please abide by the following rules: …

In a nutshell, you are welcome to participate and should not hesitate to ask questions, but remember to play nice and preferably share what you’ve learned! Please note that we do not tolerate spamming or off-topic trolling of any form.

Take advantage of anti-spam software

In additional to the verification of email address and check for black-listed IP addresses, the topic-specific questions have been very effective. Three examples of such questions are shown below:

What does the 'A' in 3DNA stand for? (hint: 4-char long)
How many standard bases does RNA have (hint: 1-digit number)
What is the value of the expression (3.1498 * 0 + 168)?

Overall, I do not like CAPTCHA — I’ve found the highly-distorted images in some websites especially troublesome. For the first few of years (to ~2014), the 3DNA Forum did not contain a captcha image in the registration page. Later on, however, I’ve noticed quite a few spam registrations/posts. In addition to quickly cleaning them up manually, I had refined the topic-specific questions, and turned on the visual verification image at level “Medium — Overlapping colored letters, with noise/lines”. Experience over the past couple of year has demonstrated the effectiveness of the combined strategy. As shown in the screen capture below, as of this writing, 177,562 spammers have been blocked by the anti-spam software!

Summary of anti-spamming on the 3DNA Forum

Verify and approve ‘suspect’ accounts quickly

The above mentioned anti-spaming measures have blocked virtually all the “bad guys” so I do not need to waste time fighting them. I receive an email notification for each successful registration. The vast majority of registrants can then immediately access the member-only download section or post questions on the 3DNA Forum after registration. A significant portion (~1 out of 6) of the registrations, however, would be masked as suspicious and need my action. The email message for such cases reads like this:

‘xxxx’ has just signed up as a new member of your forum. Click the link below to view their profile. …
Before this member can begin posting they must first have their account approved. Click the link below to go to the approval screen. …

Wherever I have access to the Internet (including after hours with an iPad Air 2), I’ve always been quick in verifying and (mostly) approving these registrations.

Overall, since was created in December 2011, the Forum has received significant attention in the field of DNA/RNA structural bioinformatics. As the community begins to appreciate and fully take advantage of what DSSR and SNAP have to offer, I have no doubt the Forum will gain even wider-spread recognition.



Ask reproducible questions, publicly

In recent years, reproducibility of ‘scientific’ publications has become quite a topic. See a recent essay Five selfish reasons to work reproducibly by Markowetz in Genome Biology (2015, 16:274). There are numerous reasons why reproducibility could become an issue at all in science. What I have continuously strived for in my scientific career, however, is to ensure that my published results are reproducible. As a concrete example, I created a dedicated section titled DSSR-NAR paper on the 3DNA Forum that provides full details (scripts and data files) so that any interested parties can rigorously reproduce the results reported in the DSSR Nucleic Acids Research (NAR) paper.

In my support of 3DNA for over a decade, the #1 issue I experienced is undoubtedly vague (non-reproducible) questions. For example, I have recently been asked via email why the 3DNA find_pair/analyze programs miss “some basepair … even though it is in the pdb file”. Without access to the PDB file to reproduce the problem, however, I cannot provide a concrete answer. In an effect to prevent ambiguous questions, I made the following explicit point in the “Registration Agreement” of the 3DNA Forum (no. 2 on the list):

Be specific with your questions; provide a minimal, reproducible example if possible; use attachments where appropriate.

The #2 issue is receiving 3DNA-related questions privately instead of on the intended public 3DNA Forum. I turned off “personal messaging” to receive private messages on the Forum long time ago, yet I have kept receiving questions via emails. In several locations on the 3DNA Forum, I have made this ‘public-question’ policy crystal clear:

Ask your questions in the public 3DNA forum instead of sending xiangjun emails or personal messages. (no. 1 on the ‘Registration Agreement’)

Please be aware that for the benefit of the 3DNA-user community at large, I do not provide private email/personal message support; the forum has been created specifically for open discussions of all 3DNA-related issues. In other words, any 3DNA-associated questions are welcome and should be directed here. Presumably I’ve made the message simple and clear enough to get across without further explanation. (in ‘Site announcements » Download instructions’ and ‘Downloads » 3DNA download’)

In response to the many 3DNA-related questions that still keep coming via email, I created the following entry of Canned Responses in gmail:

Thanks for your interest in using 3DNA. Please be aware that for the benefit of the 3DNA-user community at large, I do not provide private email support; the 3DNA Forum ( has been created specifically for open discussions of all 3DNA-related issues. In other words, any 3DNA-associated questions are welcome and should be directed there. I monitor the forum regularly and respond to posts promptly.

I look forward to seeing you on the 3DNA Forum (

Overall, I’ve learned from experience that addressing reproducible questions publicly does the best for the 3DNA community. Users can register with personal (free) email address, and post simulated data to illustrate the problem at hand. Moreover, questions on the Forum have always received quick responses. Over time, the Forum has served as an archive that everyone can benefit from.

Comment [2]


Details on the simple base-pair parameters

With the foundation laid by the previous two posts on Fitting of base reference frame and Automatic identification of nucleotides, we can now get into the details on how the ‘simple’ base-pair (bp) parameters are derived. To make the point clear, I am using two concrete examples from the yeast phenylalanine tRNA (PDB id: 1ehz): the first pair is 2MG10+G45, of type M+N (shortened to g+G) in 3DNA/DSSR; and the second example is a Watson-Crick pair U6–A67, of type M–N (shortened to U–A).

Pair 2MG10+G45 (g+G, of type M+N, see Fig. 1)


Base reference frames

2MG10+G45 pair in tRNA (1ehz)
Fig. 1: Base pair 2MG10+G45 (g+G) of type M+N in yeast phenylalanine tRNA 1ehz

In the original coordinate system (as in 1ehz.pdb downloaded from the RCSB PDB), the base-reference frames for 2MG10 and G45 are:

# base reference frame of 2MG10
  "rsmd": 0.018218,
  "origin": [65.696016, 45.134944, 18.125044],  # o1
  "x_axis": [0.690346, 0.713907, -0.117302],    # x1
  "y_axis": [-0.706849, 0.700116, 0.101003],    # y1
  "z_axis": [0.154232, 0.013188, 0.987947]      # z1
# base reference frame of G45
  "rsmd": 0.025865,
  "origin": [70.584399, 50.526567, 17.229626],  # o2
  "x_axis": [0.818521, 0.49914, -0.284399],     # x2
  "y_axis": [-0.574112, 0.728382, -0.373973],   # y2
  "z_axis": [0.020486, 0.469381, 0.882758]      # z2

The base-pair reference frame

Since dot(z1, z2) = 0.88 (positive), this pair is of type M+N in 3DNA/DSSR. The ‘mean’ z-axis of the pair is the average of z1 and z2, which is z = [0.090069, 0.248769, 0.964366] (normalized). This is the z-axis of the bp frame, as in 3DNA/DSSR.

The ‘long’ axis employs RC8 (purines) and YC6 (pyrimidines) base atoms. Here 2MG10 and G45 are all purines, so the following two C8 atoms are used:

# C8 atoms of 2MG10 and G45 in 1ehz
HETATM  208  C8  2MG A  10      62.199  48.621  18.635  1.00 40.38           C
ATOM    987  C8    G A  45      67.772  54.149  15.386  1.00 40.45           C

The vector from C8 of G45 to C8 of 2MG10 is:

y0 = [62.199  48.621  18.635] - [67.772  54.149  15.386]
   = [-5.573  -5.528   3.249]

Normally, y0 and z-axis are not orthogonal. Here they have an angle of ~81º. The orthogonal component of y0 with reference to the z-axis, when normalized, is the y-axis:

y = [-0.676751, -0.695120, 0.242520]

The x-axis is defined by the right-handed rule:

x = [-0.730682, 0.674479, -0.105746]

Overall, the orthonormal x-, y- and z-axes of the pair defined thus far are:

x = [-0.730682, 0.674479, -0.105746]
y = [-0.676751, -0.695120, 0.242520]
z = [0.090069, 0.248769, 0.964366]

Derivation of the six ‘simple’ base-pair parameters (Fig. 2)

Schematic diagram of six rigid-body base-pair parameters
Fig. 2: Schematic diagram of six rigid-body base-pair parameters

Propeller is the ‘torsion’ angle of z2 to z1 with reference to the y-axis, and is calculated using the method detailed in the blog post How to calculate torsion angle?. Here Propeller is: -24.24º. Similarly, Buckle is defined as the ‘torsion’ angle of z2 to z1 with reference to the x-axis, and is -14.81º. Opening is defined as the angle from y2 to y1 with reference to the z-axis, and is: 13.32º.

The corresponding translational parameters are simply projects of the o2 to o1 vector onto the x-, y- and z-axis, respectively. Here, they have values:

d = o1 - o2 = [-4.888383, -5.391623, 0.895418]
Shear = dot(d, x) = -0.16
Stretch = dot(d, y) = 7.27
Stagger = dot(d, z) = -0.92

‘Corrections’ of Buckle and Propeller

Base-pair non-planarity is due to the following three parameters: Buckle, Propeller, and Stagger. In particular, Buckle and Propeller cause the two bases to be non-parallel, the most noticeable characteristic of a pair. These two angular parameters are well-documented in literature, even among the canonical Watson-Crick base pairs. In 3DNA/DSSR, the angle between the two base normal vectors (in range [0, 90º]) is related to Buckle and Propeller with the formula:

interBase-angle = sqrt(Buckle^2+Propeller^2)

For the 2MG10+G45 pair, the angle between z1 and z2 is 28.18º, and sqrt(Buckle^2+Propeller^2) = 28.405º. So the following ‘corrections’ are made:

Buckle = -14.81 * 28.18 / 28.405 = -14.69
Propeller = -24.24 * 28.18 / 28.405 = -24.05

Overall, the ‘corrections’ have only small influence on the numerical values of the reported Buckle and Propeller parameters. It is ‘sensible’ that the ‘simple’ parameter have the property interBase-angle = sqrt(Buckle^2+Propeller^2), just as the original 3DNA/DSSR bp parameters.

Now, the six ‘simple’ bp parameters for 2MG10+G45, reported in 3DNA analyze program as of v2.3-2016jan01 are:

Simple base-pair parameters based on YC6-RC8 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
*    1 g+G      -0.16      7.27     -0.92    -14.69    -24.05     13.32   28.2

The corresponding local bp parameters as originally reported by 3DNA/DSSR are as follows. Note the significant differences in Shear vs. Stretch, and Buckle vs. Propeller in the two sets of bp parameters. On the other hand, Stagger is identical and Opening should be quite close, by definition. Due to the similarity in Stagger and Opening, DSSR only reports four ‘simple’ parameters (i.e., Shear, Stretch, Buckle, and Propeller).

Local base-pair parameters
     bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening
    1 g+G      -7.21     -0.97     -0.92     25.58    -11.83     13.07

Base-pair U6–A67 (Watson-Crick U–A, of type M–N, see Fig. 3)


U6–A67 pair in tRNA (1ehz)
Fig. 3: Base pair U6–A67 (U–A) of type M–N in yeast phenylalanine tRNA 1ehz

Base reference frames

In the original coordinate system (as in 1ehz.pdb downloaded from the RCSB PDB), the base-reference frames for U6 and A67 are:

# base reference frame of U6 (white in Fig. 3)
  "rsmd": 0.010835,
  "origin": [60.441988, 48.83479, 41.242523],  # o1
  "x_axis": [0.28491, 0.503019, 0.815965],     # x1
  "y_axis": [0.887155, -0.460753, -0.025726],  # y1
  "z_axis": [0.363018, 0.731217, -0.577529]    # z1
# base reference frame of A67 (colored yellow in Fig. 3)
  "rsmd": 0.01992,
  "origin": [60.578326, 48.823104, 41.154211], # o2
  "x_axis": [0.034097, 0.205538, 0.978055],    # x2
  "y_axis": [-0.90687, 0.417653, -0.056155],   # y2
  "z_axis": [-0.420029, -0.885054, 0.200637]   # z2

The base-pair reference frame

Since dot(z1, z2) = -0.92 (negative), this pair is of type M–N in 3DNA/DSSR. The y- and z-axis are thus reversed (corresponding to a 180º rotation around the x-axis) to align z2 with z1.

# base reference frame of A67, with y- and z-axes reversed
  "origin": [60.578326, 48.823104, 41.154211], # o2
  "x_axis": [0.034097, 0.205538, 0.978055],    # x2
  "y_axis": [0.90687, -0.417653, 0.056155],    # y2 -- reversed
  "z_axis": [0.420029, 0.885054, -0.200637]    # z2 -- reversed

Thereafter, the procedure is similar to the one for the M+N type above. Note here U6 is a pyrimidine, so its C6 atom is used. The final results are:

# C6 atom of U6 and C8 atom A67 in 1ehz
ATOM    132  C6    U A   6      64.926  46.497  41.084  1.00 35.72           C  
ATOM   1457  C8    A A  67      56.129  50.866  40.893  1.00 40.04           C  
y0 = [64.926  46.497  41.084] - [56.129  50.866  40.893]
   = [8.797  -4.369   0.191]
x = [0.160777, 0.363836, 0.917482]
y = [0.902274, -0.430972, 0.012793]
z = [0.400064, 0.825764, -0.397570]

The six ‘simple’ and original base-pair parameters

Simple base-pair parameters based on YC6-RC8 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
     1 U-A       0.06     -0.13     -0.08     -0.59    -23.71      5.39   23.7
# ------------
Local base-pair parameters
     bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening
    1 U-A       0.06     -0.13     -0.08     -0.63    -23.71      5.50

As can be seen, for Watson-Crick pairs, the ‘simple’ and the original bp parameters are very similar.

Special notes on the ‘simple’ base-pair parameters

  • For the most common Watson-Crick pairs, the newly introduced ‘simple’ bp parameters match those of the original 3DNA/DSSR parameters very well (as shown by the U6–A67 pair). For non-canonical pairs, significant differences in Shear, Stretch, Buckle and Propeller are expected (as illustrated by the 2MG10+G45 pair). The differences come from the divergent definitions of the bp reference frame, which is distinct for each type of non-canonical pairs.
  • Only the original 3DNA/DSSR six bp parameters can be used for exact reconstruction (with the 3DNA rebuild program) of the corresponding bp geometry. The ‘simple’ bp parameters are for description only, and they could be more intuitive than the original 3DNA/DSSR counterparts. They complement, buy by no means replace, the classic “local” bp parameters. The term ‘simple’ is used to distinguish the new from the original closely related, yet quite different bp parameters.
  • As details for the 2MG10+G45 pair, several ad hoc decisions are made in deriving the ‘simple’ bp parameters. For example, instead of using RC8–YC6 to define the y-axis, one can also use RN9–YN1 (as did by Richardson). Each such choice will lead (slightly) different numerical values, depending on the type of the non-canonical pairs. In some cases, Buckle and Propeller could differ by several degrees. Since RC8 and YC6 atoms lie near the ‘center’ of purines and pyrimidines, they are used to define the y-axis (by default). DSSR has provisions of selecting RN9–YN1, as well as a couple of other choices, for the definition of the y-axis.
  • When the M+N pair is counted as N+M, Shear, Stretch, Buckle, and Propeller remain the same, but Stagger and Opening reverse their signs. For example, here are the results of 2MG10+G45 vs. G45+2MG10:
# 2MG10+G45
Simple base-pair parameters based on YC6-RC8 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
*    1 G+g      -0.16      7.27      0.92    -14.69    -24.05    -13.32   28.2
# Reverse the order: treated as G45+2MG10
Simple base-pair parameters based on YC6-RC8 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
*    1 g+G      -0.16      7.27     -0.92    -14.69    -24.05     13.32   28.2
  • When the M–N pair is counted as N–M, Stretch, Stagger, Propeller, and Opening remain the same, but Shear and Buckle reverse their signs. For example, here are the results of U6–A67 vs. A67–U6:
# U6–A67
Simple base-pair parameters based on YC6-RC8 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
     1 U-A       0.06     -0.13     -0.08     -0.59    -23.71      5.39   23.7
# Reverse the order: treated as A67–U6
Simple base-pair parameters based on YC6-RC8 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening  angle
     1 A-U      -0.06     -0.13     -0.08      0.59    -23.71      5.39   23.7

Related posts




Fitting of base reference frame

Once a nucleotide (nt) is identified, and matched to A (C, G, T, U) for the standard case or a (c, g, t, u) for a modified one, 3DNA/DSSR performs a least-squares fitting procedure to locate the base reference frame in three-dimensional space. The basic idea is very simple and widely applicable. The algorithm constitutes one of the key components of 3DNA/DSSR. As always, the details can be most effectively illustrated with a worked example. Using G1 in the yeast phenylalanine tRNA (PDB id: 1ehz) as an example, the atomic coordinates of its nine base-ring atoms are:

# G1, nine base-ring atoms for ls-fitting
ATOM     14  N9    G A   1      51.628  45.992  53.798  1.00 93.67           N  
ATOM     15  C8    G A   1      51.064  46.007  52.547  1.00 92.60           C  
ATOM     16  N7    G A   1      51.379  44.966  51.831  1.00 91.19           N  
ATOM     17  C5    G A   1      52.197  44.218  52.658  1.00 91.47           C  
ATOM     18  C6    G A   1      52.848  42.992  52.425  1.00 90.68           C  
ATOM     20  N1    G A   1      53.588  42.588  53.534  1.00 90.71           N  
ATOM     21  C2    G A   1      53.685  43.282  54.716  1.00 91.21           C  
ATOM     23  N3    G A   1      53.077  44.429  54.946  1.00 91.92           N  
ATOM     24  C4    G A   1      52.356  44.836  53.879  1.00 92.62           C  

The corresponding nine base-ring atoms of G in its standard base reference frame are listed below. See Table 1 of the report A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry, and file Atomic_G.pdb distributed with 3DNA ($X3DNA/config/Atomic_G.pdb). In DSSR, the content has been integrated into the source code to make the program self-contained.

# G in standard base reference frame
ATOM      2  N9    G A   1      -1.289   4.551   0.000
ATOM      3  C8    G A   1       0.023   4.962   0.000
ATOM      4  N7    G A   1       0.870   3.969   0.000
ATOM      5  C5    G A   1       0.071   2.833   0.000
ATOM      6  C6    G A   1       0.424   1.460   0.000
ATOM      8  N1    G A   1      -0.700   0.641   0.000
ATOM      9  C2    G A   1      -1.999   1.087   0.000
ATOM     11  N3    G A   1      -2.342   2.364   0.001
ATOM     12  C4    G A   1      -1.265   3.177   0.000

A least-squares fitting of the standard onto the experimental set of base-ring atoms defines the base reference frame (Fig. 1). The information is available via the following commands:

# find_pair -s 1ehz.pdb # in file 'ref_frames.dat'
...     1 G   # A:...1_:[..G]G
   53.7571    41.8678    52.9303  # origin
   -0.2589    -0.2496    -0.9331  # x-axis
   -0.5430     0.8365    -0.0731  # y-axis
    0.7988     0.4878    -0.3521  # z-axis
# --------
# x3dna-dssr -i=1ehz.pdb --json | jq .nts[0].frame
  rsmd: 0.008,
  origin: [53.757, 41.868, 52.93],
  x_axis: [-0.259, -0.25, -0.933],
  y_axis: [-0.543, 0.837, -0.073],
  z_axis: [0.799, 0.488, -0.352]

G1 in yeast tRNA
Fig. 1: G1 in tRNA 1ehz, with base reference frame attached

Please note the following subtle points:

  • The standard base (Atomic_G.pdb) is already set in its reference frame: the z-coordinates are virtually zeros, y-coordinates are positive, the atoms along the minor-groove edge have negative x-coordinates, as can be visualized clearly from the attached coordinate frame. In 3DNA, the five standard standard bases are in stored in files Atomic_[ACGTU].pdb, and the corresponding modified ones are in Atomic_[acgtu].pdb. For simplicity, Atomic_A.pdb and Atomic_a.pdb are the same by default, as are the other four cases.
  • The translation and rotation of the least-squares fitting process define the experimental base reference frame (for G1 in the above example), and its three axes are orthonormal by definition.
  • By design, the base rings of Atomic_A.pdb and Atomic_G.pdb match each other closely (see below), as are the pyrimidines bases. The least-square fitted root-mean-square deviation (rmsd) of the nine base-ring atoms between standard A and G is only 0.04 Å. Fitting the standard A (instead of G) onto G1 of 1ehz leads to a base reference frame that is essentially indistinguishable from the one above (see below). This feature shows that any ambiguity in assigning modified purines to A or G, or pyrimidines to C, T, or U causes no notable differences in 3DNA/DSSR results.
Comparison of base-ring atomic coordinates in standard G and A
          Atomic_G.pdb                         Atomic_A.pdb
N9 G   -1.289   4.551   0.000   |   N9  A   -1.291   4.498   0.000
C8 G    0.023   4.962   0.000   |   C8  A    0.024   4.897   0.000
N7 G    0.870   3.969   0.000   |   N7  A    0.877   3.902   0.000
C5 G    0.071   2.833   0.000   |   C5  A    0.071   2.771   0.000
C6 G    0.424   1.460   0.000   |   C6  A    0.369   1.398   0.000
N1 G   -0.700   0.641   0.000   |   N1  A   -0.668   0.532   0.000
C2 G   -1.999   1.087   0.000   |   C2  A   -1.912   1.023   0.000
N3 G   -2.342   2.364   0.001   |   N3  A   -2.320   2.290   0.000
C4 G   -1.265   3.177   0.000   |   C4  A   -1.267   3.124   0.000
Comparison of G1 (1ehz) base reference frame derived using standard G or A
             Atomic_G.pdb                |             Atomic_A.pdb
 53.7571    41.8678    52.9303  # origin | 53.7286    41.9276    52.9482  # origin
 -0.2589    -0.2496    -0.9331  # x-axis | -0.2562    -0.2540    -0.9327  # x-axis
 -0.5430     0.8365    -0.0731  # y-axis | -0.5444     0.8352    -0.0780  # y-axis
  0.7988     0.4878    -0.3521  # z-axis |  0.7988     0.4878    -0.3522  # z-axis

Related topics:



Automatic identification of nucleotides

Any analysis of nucleic acid structures start with the identification of nucleotides (nts), the basic building unit. As per the PDB convention, each nt (like any other ligands) is specified by a three-letter identifier. For example, the four standard RNA nts are ..A, ..C, ..G, and ..U, respectively. The four corresponding standard DNA nts are .DA, .DC, .DG, and .DT, respectively. Note that here, for visualization purpose, each space is represented by a dot (.). In practice, the following codes for the five standard DNA/RNA nts — ADE, CYT, GUA, THY, and URA — are also commonly encountered, among other variants.

On top of the standard nts, there are numerous modified ones, each assigned a unique three-letter code. In the classic yeast phenylalanine tRNA (PDB id: 1ehz), 14 out of the 76 nts are modified, as shown in Fig. 1 below.

Modified nucleotides in yeast tRNA
Fig. 1: Modified nucleotides in yeast phenylalanine tRNA 1ehz

It is challenging to maintain a comprehensive and updated list of ever-inceasing nts encountered in the PDB and molecular dynamics (MD) simulation packages (e.g., AMBER, GROMACS, and CHARMM). Thus, as of today, some well-known DNA/RNA structural bioinformatics tools can handle only standard nts or a limited list of modified ones.

From early on in the development of 3DNA, I observed that all recognized nts have a core six-membered ring, with atoms named N1,C2,N3,C4,C5,C6 consecutively (see Fig. 2 below). Purines have three additional atoms, named N7,C8,N9. So it is feasible to automatically identify nts, and classify them as pyrimidines and purines, based on the common core skeleton shared by all of them. Moreover, the ‘skeleton’ is not effected by any possible tautomeric or protonation state.

Common names of core base atoms
Fig. 2: Identification of nts in 3DNA/DSSR based on atomic names and planar geometry

Early versions of 3DNA employed only three atoms (N1, C2 and C6) and three distances to decide a nt. Purines were further discriminated by the N9 atom, and the N1–N9 distance. While developing DSSR, I revised the nt-identification algorithm by using a least-squares fitting procedure that makes use of all available base ring atoms instead of selected ones. The same new algorithm has also been adapted into the find_pair/analyze etc programs in 3DNA, as of v2.2.

As always, the idea can be best illustrated with a worked example. Guanine in its standard base reference frame, with the following list of nine ring atoms coordinates, is chosen for the least-squares fitting. See file Atomic_G.pdb in the 3DNA distribution, and also Table 1 of the report A Standard Reference Frame for the Description of Nucleic Acid Base-pair Geometry.

ATOM      2  N9    G A   1      -1.289   4.551   0.000
ATOM      3  C8    G A   1       0.023   4.962   0.000
ATOM      4  N7    G A   1       0.870   3.969   0.000
ATOM      5  C5    G A   1       0.071   2.833   0.000
ATOM      6  C6    G A   1       0.424   1.460   0.000
ATOM      8  N1    G A   1      -0.700   0.641   0.000
ATOM      9  C2    G A   1      -1.999   1.087   0.000
ATOM     11  N3    G A   1      -2.342   2.364   0.001
ATOM     12  C4    G A   1      -1.265   3.177   0.000

By using a ls-fitting procedure, only (any) three atoms are needed. We no longer need to make explicit selection, as we did previously (N1,C2,C6 and N9), thus allowing for possible modification on these atoms.

Using four nts (G1, 2MG10, H2U16, and PSU39, see Fig. 1 above top) of 1ehz as examples, the following list gives the atomic coordinates of base ring atoms, and root-mean-squres devisions (rmsd) of the least-squares fit. Of course, when performing least-squares fitting, the names of corresponding atoms must match (note the different ordering of atoms for H2U and PSU in the list vs the above standard G reference).

#G1, rmsd=0.008
ATOM     14  N9    G A   1      51.628  45.992  53.798  1.00 93.67           N  
ATOM     15  C8    G A   1      51.064  46.007  52.547  1.00 92.60           C  
ATOM     16  N7    G A   1      51.379  44.966  51.831  1.00 91.19           N  
ATOM     17  C5    G A   1      52.197  44.218  52.658  1.00 91.47           C  
ATOM     18  C6    G A   1      52.848  42.992  52.425  1.00 90.68           C  
ATOM     20  N1    G A   1      53.588  42.588  53.534  1.00 90.71           N  
ATOM     21  C2    G A   1      53.685  43.282  54.716  1.00 91.21           C  
ATOM     23  N3    G A   1      53.077  44.429  54.946  1.00 91.92           N  
ATOM     24  C4    G A   1      52.356  44.836  53.879  1.00 92.62           C  
#2MG10, rmsd=0.018
HETATM  207  N9  2MG A  10      61.581  47.402  18.752  1.00 42.14           N  
HETATM  208  C8  2MG A  10      62.199  48.621  18.635  1.00 40.38           C  
HETATM  209  N7  2MG A  10      63.494  48.534  18.422  1.00 40.70           N  
HETATM  210  C5  2MG A  10      63.745  47.167  18.395  1.00 43.82           C  
HETATM  211  C6  2MG A  10      64.965  46.449  18.205  1.00 43.45           C  
HETATM  213  N1  2MG A  10      64.767  45.086  18.293  1.00 44.71           N  
HETATM  214  C2  2MG A  10      63.541  44.482  18.486  1.00 47.21           C  
HETATM  217  N3  2MG A  10      62.411  45.125  18.614  1.00 45.85           N  
HETATM  218  C4  2MG A  10      62.574  46.451  18.582  1.00 43.27           C  
#H2U16, rmsd=0.188
HETATM  336  N1  H2U A  16      77.347  53.323  34.582  1.00 91.19           N  
HETATM  337  C2  H2U A  16      76.119  52.865  34.160  1.00 92.39           C  
HETATM  339  N3  H2U A  16      75.123  52.894  35.107  1.00 93.28           N  
HETATM  340  C4  H2U A  16      75.289  52.711  36.458  1.00 93.34           C  
HETATM  342  C5  H2U A  16      76.696  52.479  36.909  1.00 93.77           C  
HETATM  343  C6  H2U A  16      77.717  53.238  36.039  1.00 93.22           C  
#PSU39, rmsd=0.004
HETATM  845  N1  PSU A  39      74.080  36.066   5.459  1.00 75.82           N  
HETATM  846  C2  PSU A  39      74.415  36.835   4.354  1.00 75.59           C  
HETATM  847  N3  PSU A  39      75.735  36.769   3.984  1.00 76.29           N  
HETATM  848  C4  PSU A  39      76.728  36.038   4.591  1.00 77.28           C  
HETATM  849  C5  PSU A  39      76.307  35.280   5.732  1.00 77.93           C  
HETATM  850  C6  PSU A  39      75.025  35.316   6.112  1.00 76.07           C  

As noted in the DSSR paper, the rmsd is normally <0.1 Å since base rings are rigid. To account for experimental error and special non-planar cases, such as H2U in 1ehz, the default rmsd cutoff is set to 0.28 Å by default.

With the above detailed algorithm, DSSR (and the 3DNA find_pair/analyze programs) can automatically identify virtually all ‘recognizable’ nts in the PDB. A survey performed in June 2015 detected 630 different types of modified nucleotides in the PDB.

It is worth noting the following points:

  • The choice of standard G instead of A as the reference base has no impact on the results. As a matter of fact, the rmsd between G and A is only 0.04 Å. Note also the generous default cutoff of 0.28 Å.
  • The method obviously depends on proper naming of the ring atoms. Specially, the base ring atoms must be named N1,C2,N3,C4,C5,C6 consecutively, with purines having three additional atoms named N7,C8,N9. Thus, under this scheme, TPP (thiamine diphosphate) would not be recognized as a nt by default, simply because of the extra prime (′) of atoms in the six-membered ring. In nucleic acid structures, the prime symbol is normally associated with atoms of the sugar moiety (e.g., the C5′ atom).

Molecular image of TPP (thiamine diphosphate)
Fig. 3: TPP (thiamine diphosphate) would not be recognized as a nt.

  • On the other hand, nt cofactors in an otherwise ‘pure’ protein structure will also be recognized. One example is the two AMP (adenosine monophosphate) ligands in PDB entry 12as. This extra identification of nts does no harm in such cases. As shown in the analysis of the SAM-I riboswitch in the DSSR paper, taking the SAM ligand as a nt in base triplet recognition is a neat feature.
  • Once a nucleotide has been identified and classified into purines and pyrimidines, exocyclic atoms can be used for further assignment: O6 or N2 distinguishes guanine from adenine, N4 separates cytosine from thymine and uracil, and C7 (or C5M, the methyl group) differentiates thymine from uracil. For some modified nts, the distinctions within purines or pyrimidines may not be that obvious. For example, inosine may be taken as a modified guanine or adenine. However, this ambiguity does not pose any significant effect on the calculated base-pair parameters.
  • In DSSR and 3DNA, each identified nt is assigned a one-letter shorthand code: the standard ..A, .DA, and ADE (among a few other common variations) is shortened to upper-case A, and similarly for C, G, T, and U. Modified nts, on the other hand, are shortened to their corresponding lower-case symbol. For example, modified guanine such as 2MG and M2G in the yeast phenylalanine tRNA (see Fig. 1 above) is assigned g. So in 3DNA/DSSR output, the upper and lower cases of bases (e.g., nts=3 gCG A.2MG10,A.C25,A.G45) convey special meanings.

Related topics:



'Simple' parameters for non-Watson-Crick base pairs

As of v2.3-2016jan01, the 3DNA analyze program outputs a list of new ‘simple’ base-pair and step parameters, by default. Shown below is a sample output for PDB entry 1xvk. This echinomycin-(GCGTACGC)2 complex has a single DNA strand as the asymmetric unit. 3DNA needs the the biological unit (1xvk.pdb1) to analyze the duplex (with the -symm option). This structure contains two Hoogsteen base pairs, and has popped up on the 3DNA Forum for the zero or negative Rise values. Note that the ‘simple’ Rise values are all positive; for the middle (#4) TA/TA step, it is now 3.09 Å instead of 0.

# find_pair -symm 1xvk.pdb1 1xvk.bps
# analyze -symm 1xvk.bps
#   OR by combing the above two commands:
# find_pair -symm 1xvk.pdb1 | analyze -symm
# The output is in file '1xvk.out'
This structure contains 4 non-Watson-Crick (with leading *) base pair(s)
Simple base-pair parameters based on RC8--YC6 vectors
      bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening
*    1 G+C      -3.07      1.55     -0.35     -6.98      0.29     67.33
     2 C-G       0.27     -0.17      0.35    -22.34      3.33     -2.80
     3 G-C      -0.39     -0.17      0.41     22.91      1.81     -2.73
*    4 T+A      -3.29      1.56      0.31     -8.03      1.59    -70.46
*    5 A+T      -3.29      1.56     -0.31     -8.03      1.59     70.46
     6 C-G       0.39     -0.17      0.41    -22.91      1.81     -2.72
     7 G-C      -0.27     -0.17      0.35     22.34      3.32     -2.80
*    8 C+G      -3.07      1.55      0.35     -6.98      0.30    -67.33
       ave.     -1.59      0.69      0.19     -3.75      1.75     -1.38
       s.d.      1.72      0.92      0.32     17.57      1.15     52.11
Simple base-pair step parameters based on consecutive C1'-C1' vectors
      step       Shift     Slide      Rise      Tilt      Roll     Twist
*    1 GC/GC     -0.55      0.39      7.41      6.40     -4.22     23.36
     2 CG/CG     -0.05      0.87      2.44     -0.55      3.94     -0.81
*    3 GT/AC      0.38      0.47      7.23     -8.62      3.75     25.70
*    4 TA/TA     -0.00      4.73      3.09     -0.00      7.49     25.67
*    5 AC/GT     -0.38      0.47      7.23      8.62      3.75     25.70
     6 CG/CG      0.05      0.87      2.44      0.55      3.94     -0.82
*    7 GC/GC      0.55      0.39      7.41     -6.40     -4.22     23.36
        ave.     -0.00      1.17      5.32     -0.00      2.06     17.45
        s.d.      0.39      1.59      2.50      6.21      4.49     12.52

The simple parameters are ‘intuitive’ for non-Watson-Crick base pairs and associated base-pair steps, where the existing standard-reference-frame-based 3DNA parameters may look weird. Note that these simple parameters are for structural description only, not to be fed into the ‘rebuild’ program. Overall, they complement the rigorous characterization of base-pair geometry, as demonstrated by the original analyze/rebuild pair of programs in 3DNA.

In short, the ‘simple’ base-pair parameters employ the YC6—RC8 vector as the y-axis whereas the ‘simple’ step parameters use consecutive C1’—C1’ vectors. As before, the z-axis is the average of two base normals, taking consideration of the M–N vs M+N base-pair classification. In essence, the ‘simple’ parameters make geometrical sense by introducing an ad hoc base-pair reference frame in each case. More details will be provided in a series of blog posts shortly.

Overall, this new section of ‘simple’ parameters should be taken as experimental. The output can be turned off by specifying the analyze -simple=false command-line option explicitly. As always, I greatly appreciate your feedback.



Identification of multiplets in DSSR

In DSSR (and find_pair -p from the original 3DNA suite), multiplets is defined as “three or more bases associated in a coplanar geometry via a network of hydrogen-bonding interactions. Multiplets are identified through inter-connected base pairs, filtered by pair-wise stacking interactions and vertical separations to ensure overall coplanarity.”

DSSR detects multiplets automatically, and outputs a corresponding MODEL/ENDMDL delineated PDB file (dssr-multiplets.pdb by default) where each multiplet is laid in the most extended view in terms of base planes. The DSSR Nucleic Acids Research (NAR) paper contains four examples (in supplemental Figures 1, 3, 4, and 7) to illustrate this functionality. Please refer to Reproducing results published in the DSSR-NAR paper on the 3DNA Forum for details.

Recently, I read the article titled InterRNA: a database of base interactions in RNA structures by Appasamy et al. in NAR. In Figure 2 (linked below) of the paper, the authors showcased a sextuple (hexaplet) identified in the E. coli ribosome (PDB id: 4tpe), along with six base-base H-bonds contained therein.

Hexaplet GUUAAA in 4tpe
Figure 2. Example of the user interface displaying an InterRNA database record.

With interest, I tried to run DSSR on the PDB entry 4tpe. As it turns out, ‘4tpe’ has been merged into 4u27 in mmCIF format. I ran DSSR (v1.4.6-2015dec16) in its default settings on ‘4u27’ and get the following summary of results.

# x3dna-dssr -i=4u27.cif -o=4u27.out
    total number of base pairs: 4822
    total number of multiplets: 680
    total number of helices: 264
    total number of stems: 566
    total number of isolated WC/wobble pairs: 193
    total number of atom-base capping interactions: 615
    total number of hairpin loops: 215
    total number of bulges: 137
    total number of internal loops: 244
    total number of junctions: 108
    total number of non-loop single-stranded segments: 83
    total number of kissing loops: 14
    total number of A-minor (type I and II) motifs: 246
    total number of ribose zippers: 127
    total number of kink turns: 15

Among the 680 DSSR-identified multiplets, two hexaplets (one on chain “AA”, and another on “CA”) match those reported by Appasamy et al., as shown below:

 678 nts=6 GUUAAA 1:AA.G404,1:AA.U438,1:AA.U439,1:AA.A496,1:AA.A498,1:AA.A499
 679 nts=6 GUUAAA 1:CA.G404,1:CA.U438,1:CA.U439,1:CA.A496,1:CA.A498,1:CA.A499

For illustration, the hexaplet #678 is extracted from dssr-multiplets.pdb to file 4u27-hexaplet.pdb (download the coordinates) and shown below. The figure is generated by DSSR and PyMOL, as detailed in Reproducing results published in the DSSR-NAR paper on the 3DNA Forum.

x3dna-dssr -i=4u27-hexaplet.pdb -o=4u27-hexaplet.pml --hbfile-pymol 

Hexaplet GUUAAA in 4u27
DSSR-identified hexaplet GUUAAA in 4u27.

DSSR identifies 6 base pairs in the hexaplet:

# x3dna-dssr -i=4u27-hexaplet.pdb --idstr=short
List of 6 base pairs
      nt1            nt2           bp  name        Saenger    LW  DSSR
   1 G404           A498           G+A --          n/a       tSS  tm+m
   2 G404           A499           G+A --          n/a       cWH  cW+M
   3 U438           A496           U-A rHoogsteen  24-XXIV   tWH  tW-M
   4 U439           A496           U-A --          n/a       cH.  cM-.
   5 U439           A498           U-A WC          20-XX     cWW  cW-W
   6 A496           A498           A+A --          n/a       cWH  cW+M

It detects a total of 9 H-bonds as shown below. In addition to the 6 base-base H-bonds noted by Appasamy et al., DSSR also finds 3 sugar-base H-bonds (#1, #2, and #4, labeled in green) that obviously play a role in stabilizing the high-order base association.

# x3dna-dssr -i=4u27-hexaplet.pdb --get-hbonds --idstr=short
   11    59  #1     o    3.017 O:N O2'@G404 N3@U439
   11   104  #2     o    2.578 O:N O2'@G404 N1@A498
   18   125  #3     p    3.089 O:N O6@G404 N6@A499
   21    96  #4     o    3.289 N:O N2@G404 O2'@A498
   21   106  #5     p    2.797 N:N N2@G404 N3@A498
   39    78  #6     p    2.944 N:N N3@U438 N7@A496
   61    81  #7     p    3.167 O:N O4@U439 N6@A496
   61   103  #8     p    2.662 O:N O4@U439 N6@A498
   82   103  #9     p    3.152 N:N N1@A496 N6@A498



Quality control of DSSR (3DNA) source code

Over the years, I have played quite a few computer programming languages. ANSI C has become my top choice for ‘serious’ software projects, due to its small size, efficiency, flexibility, and ubiquitous support. Moreover, C is a mature language, with a rich ecosystem. As it turns out, C has also been consistently rated as one of the most popular computer languages (#1 or #2) over the past thirty years.

Needless to say, ANSI C has its own quirks, and it takes a steep learning curve. However, once you get over the hurdles, the language serves you. I cannot remember when, but it has been a long while that coding in ANSI C is no longer an issue. It is the understanding of scientific questions that takes most of my time, and coding helps greatly in refining my thoughts.

Not surprisingly, ANSI C was chosen as the sole language for DSSR (and SNAP, or 3DNA in general). The ensure the overall quality of the DSSR codebase, I have taken the following steps:

  • The whole project is under git.
  • The ANSI C source code is compiled with strict GCC options for full compliance to the standard:
-ansi -pedantic -W -Wall -Wextra -Wunused -Wshadow -Werror -O3
  • The executable is checked with valgrind for any memory leak:
valgrind --leak-check=full x3dna-dssr -i=1ehz.pdb -o=1ehz.out --quiet
==19624== Memcheck, a memory error detector
==19624== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==19624== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==19624== Command: x3dna-dssr -i=1ehz.pdb -o=1ehz.out --quiet
==19624== HEAP SUMMARY:
==19624==     in use at exit: 0 bytes in 0 blocks
==19624==   total heap usage: 52,829 allocs, 52,829 frees, 92,878,578 bytes allocated
==19624== All heap blocks were freed -- no leaks are possible
==19624== For counts of detected and suppressed errors, rerun with: -v
==19624== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
  • Extensive tests (with the simple diff command) to ensure the program is working as expected.

The above four measures combined allow me to add new features, refactor the code, and fix bugs, without worrying about accidentally breaking existing functionality. Reading literature (including citations to 3DNA/DSSR) and responding to user feedback on the 3DNA Forum keep me continuously improve DSSR. Some of the recent refinements to DSSR came about this way.



First mention of SNAP in a peer-reviewed journal article

The other day, I came across an article titled Different duplex/quadruplex junctions determine the properties of anti-thrombin aptamers with mixed folding by Krauss et al. published in Nucleic Acids Research (NAR). This NAR article draw my attention via Google Scholar alert because of its citation to the 2008 3DNA Nature Protocols paper, as shown below (in the Structural analysis section):

3DNA-dssr (41) was used to calculate local and overall geometric parameters of the aptamer. Superpose program from CCP4 package (42) was used to calculate root mean square deviations. Features of the thrombin–RE31 interface were calculated using Cocomaps server (43), whereas contacts between the two molecules, as well as packing interactions between the aptamer and symmetry related thrombin molecules, were found by using 3DNA-snap (41) and Pisa (44) programs. All the results were veri ed by visual inspec- tion of the structure with WinCoot (39).

Moveover, Table 2 lists Stacking interactions as calculated by 3DNA-DSSR (41) among residues belonging to the duplex, the junction and the quadruplex of RE31, with a note on the definition of base-stacking interactions:

Base-stacking is quantified as the area of the overlapped polygon de ned by the two bases of the interacting nucleotides, where the base atoms are projected onto the mean base plane.

To the best of my knowledge, this is the first time SNAP is mentioned in a peer-reviewed journal article. This paper also made good use of DSSR for the analysis of a complicated DNA structure (like RNA), with three non-canonical base pairs at the duplex/quadruplex junction (Figure 3) and extensive stacking interactions (Figure 4).

The duplex/quadruplex junction in RE31 aptamer.
Figure 3. The duplex/quadruplex junction in RE31 aptamer.

Ribbon representation of RE31 highlighting the continuous stacking of bases from the duplex to the quadruplex region.
Figure 4. Ribbon representation of RE31 highlighting the continuous stacking of bases from the duplex to the quadruplex region.

As this paper and those by Paul Paukstelis illustrate, DNA can adopt far more complicated 3D structures enabled by non-canonical base pairing schemes than the simple Watson-Crick paired double helices. 3DNA (including DSSR and SNAP) is well suited for the analysis of such extraordinary structures. On a different perspective, following 3DNA citations has become an effective way for me to keep in pace with relevant literature.



First citation to the DSSR NAR paper in JMB

Recently, I noticed via Google Scholar the first citation to the paper DSSR, an integrated software tool for dissecting the spatial structure of RNA, recently published in Nucleic Acids Research (NAR). The citation is from Srinivas Somarowthu, in a review article titled Progress and current challenges in modeling large RNAs in the Journal of Molecular Biology. The JMB review article is concise, and overall a nice reading.

Specifically, in the section “Model Evaluation and Refinement”, DSSR is listed along with RNAView and MC-Annotate for the characterization of the secondary from 3D atomic coordinates, as below:

After building a model, it is essential to evaluate the quality, find any errors and refine the accordingly. First, it is important to make sure that all the base-pairs and the overall secondary structure is maintained correctly in the model. Tools such as RNAview [82], MC-Annotate [83], and DSSR [84] can calculate the secondary structure from a given 3D structure and thereby allow identification of problematic base-pairs. Recently, Antczak et al [85], developed a web server, RNApdbee, which integrates RNAview, MC-Annotate and DSSR, and extracts not only secondary structures but also kissing-loops and pseudoknots from a target tertiary model. Problematic base pairs can be fixed or rebuilt using interactive tools such as S2S/ASSEMBLE [45].

I am glad to see the first citation to the 2015 DSSR paper per se shortly after its publication in NAR. Looking forward, I can only expect more DSSR citations in diverse fields related to RNA structures.



Metallo-base pairs can be identified by DSSR

Recently, I became aware of the metallo-base pairs, such as T-Hg-T (PDB id: 4l24) and C-Ag-C (5ay2) from the work of Kondo et al (Pubmed: 24478025 and 26448329). As of v1.4.3-2015oct23, DSSR can detect such metallo-bps automatically, as shown below:

# x3dna-dssr -i=4l24.pdb -o=4l24.out
List of 12 base pairs
      nt1            nt2           bp  name        Saenger    LW  DSSR
   1 A.DC1          B.DG24         C-G WC          19-XIX    cWW  cW-W
   2 A.DG2          B.DC23         G-C WC          19-XIX    cWW  cW-W
   3 A.DC3          B.DG22         C-G WC          19-XIX    cWW  cW-W
   4 A.DG4          B.DC21         G-C WC          19-XIX    cWW  cW-W
   5 A.DA5          B.DT20         A-T WC          20-XX     cWW  cW-W
   6 A.DT6          B.DT19         T-T Metal       n/a       cWW  cW-W
   7 A.DT7          B.DT18         T-T Metal       n/a       cWW  cW-W
   8 A.DT8          B.DA17         T-A WC          20-XX     cWW  cW-W
   9 A.DC9          B.DG16         C-G WC          19-XIX    cWW  cW-W
  10 A.DG10         B.DC15         G-C WC          19-XIX    cWW  cW-W
  11 A.DC11         B.DG14         C-G WC          19-XIX    cWW  cW-W
  12 A.DG12         B.DC13         G-C WC          19-XIX    cWW  cW-W


# x3dna-dssr -i=5ay2.pdb -o=5ay2.out
List of 24 base pairs
      nt1            nt2           bp  name        Saenger    LW  DSSR
   1 A.G1           B.C12          G-C WC          19-XIX    cWW  cW-W
   2 A.G2           B.C11          G-C WC          19-XIX    cWW  cW-W
   3 A.A3           B.U10          A-U WC          20-XX     cWW  cW-W
   4 A.C4           B.C9           C-C Metal       n/a       cWW  cW-W
   5 A.U5           B.A8           U-A WC          20-XX     cWW  cW-W
   6 A.CBR6         B.G7           c-G WC          19-XIX    cWW  cW-W
   7 A.G7           B.CBR6         G-c WC          19-XIX    cWW  cW-W
   8 A.A8           B.U5           A-U WC          20-XX     cWW  cW-W
   9 A.C9           B.C4           C-C Metal       n/a       cWW  cW-W
  10 A.U10          B.A3           U-A WC          20-XX     cWW  cW-W
  11 A.C11          B.G2           C-G WC          19-XIX    cWW  cW-W
  12 A.C12          B.G1           C-G WC          19-XIX    cWW  cW-W
  13 C.G1           D.C12          G-C WC          19-XIX    cWW  cW-W
  14 C.G2           D.C11          G-C WC          19-XIX    cWW  cW-W
  15 C.A3           D.U10          A-U WC          20-XX     cWW  cW-W
  16 C.C4           D.C9           C-C Metal       n/a       cWW  cW-W
  17 C.U5           D.A8           U-A WC          20-XX     cWW  cW-W
  18 C.CBR6         D.G7           c-G WC          19-XIX    cWW  cW-W
  19 C.G7           D.CBR6         G-c WC          19-XIX    cWW  cW-W
  20 C.A8           D.U5           A-U WC          20-XX     cWW  cW-W
  21 C.C9           D.C4           C-C Metal       n/a       cWW  cW-W
  22 C.U10          D.A3           U-A WC          20-XX     cWW  cW-W
  23 C.C11          D.G2           C-G WC          19-XIX    cWW  cW-W
  24 C.C12          D.G1           C-G WC          19-XIX    cWW  cW-W

Note the name “Metal” for the metallo-bps. Moreover, the corresponding entries in the ‘dssr-pairs.pdb’ file also include the metal ions, as shown below:

Metallo T-Hg-T base pair (PDB id: 4l24)Metallo C-Ag-C base pair (PDB id: 5ay2)

It is worth noting that in a metallo-bp, the metal ion lies approximately in the bp plane. Moreover, it is in the middle of the two bases, which would otherwise not form a pair in the conventional sense.



Citation statistics to 3DNA publications

On October 29, 2015, I performed a survey of citations to the following three 3DNA papers, using the Web of Science. The total number of citations are: NAR03 (787) + NP08 (184) + NAR09 (78) = 1049, spanning a diverse set of 191 journals in biology, chemistry, and material sciences. On the same date, Google Scholar reported 1360 citations for the same three papers.

  1. [NAR03] Lu, Xiang‐Jun, and Wilma K. Olson. “3DNA: a software package for the analysis, rebuilding and visualization of three‐dimensional nucleic acid structures.” Nucleic acids research 31.17 (2003): 5108-5121.
  2. [NP08] Lu, Xiang-Jun, and Wilma K. Olson. “3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures.” Nature protocols 3.7 (2008): 1213-1227.
  3. [NAR09] Zheng, Guohui, Xiang-Jun Lu, and Wilma K. Olson. “Web 3DNA—a web server for the analysis, reconstruction, and visualization of three-dimensional nucleic-acid structures.” Nucleic acids research 37.suppl 2 (2009): W240-W246.

Among the 1049 citations in 191 journals, 694 citations (66%) are from the following 24 journals (~13%). The remaining 355 citations are from 167 other journals, including Cell (5 times), Science (2), Nature (3) and six additional Nature Publishing Group sub-journals (17).

1 Nucleic Acids Res (167)
2 J Phys Chem B (64)
3 Biochemistry (45)
4 J Am Chem Soc (45)
5 J Mol Biol (41)
6 Phys Chem Chem Phys (28)
7 Biophys J (25)
8 J Biol Chem (23)
9 PLoS One (23)
10 Acta Crystallogr D Biol Crystallogr (22)
11 J Chem Theory Comput (22)
12 Proc Natl Acad Sci U S A (22)
13 Bioinformatics (18)
14 Biopolymers (18)
15 J Biomol Struct Dyn (18)
16 J Chem Phys (18)
17 J Phys Chem A (16)
18 Structure (13)
19 RNA (12)
20 Biochem Biophys Res Commun (11)
21 Chem Res Toxicol (11)
22 J Comput Chem (11)
23 J Mol Model (11)
24 Nat Struct Mol Biol (10)

It is worth noting that while the Web of Science citation report is comprehensive, it is certainly not complete. In particular, citations in the online methods section seem not to be covered. For example, two 3DNA citations (on the DSSR program) in “Materials and Methods” (the Supplementary Materials) of two Science articles by the Ramakrishnan lab are missing from the list. Specifically, the Science papers employed DSSR for the characterization of RNA secondary structural features in crystal structures of the large ribosomal subunit and the whole ribosome of human mitochondria.

For those why are interested in knowing the details, click the link for the full reports of 3DNA citations. In the file, the citations are sorted in two ways: by citation numbers per journal, and by journal names.



Jmol and DSSR

From the Jmol mailing list, I noticed Jmol 14.4.0 was released yesterday (October 13, 2015) by Dr. Bob Hanson. Among the development highlights is the following item:

biomolecule annotations including DSSR, RNA3D, EBI sequence domains, and PDB validation data

I am glad to see that DSSR has been integrated into Jmol, one of the most popular molecular graphics visualization programs. To enable easy access to the DSSR functionality from Jmol, I’ve set up two websites with easy-to-remember URLs: and They both point to the same jsmol/ folder extracted from of the Jmol distribution.

In retrospect, I first met Bob at the Workshop on the PDBx/mmCIF Data Exchange Format for Structural Biology held at Rutgers University during October 21-22, 2013. I approached him during a lunch break, asking for a possible collaboration on integrating DSSR into Jmol. The name DSSR may have played a role in convincing Bob, since it matches the well-known DSSP program for proteins. In the end, we were both excited about the project, talked into details after the meeting, and continued our conversation the next morning while I drove him to the airport.

Nothing real happened until early April 2014. Once getting started, however, we moved forward rapidly: it took less then three weeks to get the first functional version ready for the community to play. See Bob’s announcement RNA/DNA Secondary Structure, anyone? in the Jmol mailing list on April 9, 2014. During this process, we communicated extensively via email, up to 30 messages per day, on technical details for better communication between the two programs. The integration works by using Jmol as a front-end, which calls a web-serivce hosted at Columbia University for DSSR analysis. Jmol’s parsing of the DSSR output is facilitated by the dedicated --jmol option.

The above preliminary, yet functional, DSSR-Jmol integration had be in service without infrastructural changes until two months ago. In August 10, 2015, Bob contacted me:

I might make a significant request though. That would be for the server to deliver all this in JSON format. This is really the way to go. It is what people want and it is perfect for Jmol as well.

I’d played around with JSON or SQLite as a structured data exchange format for quite some time, and Bob’s request finally convinced me that JSON is the (better) way to go. And that began another around of intensive collaborative work that has switched the exchange format between DSSR and Jmol from plain text output to JSON. From August 10 to September 22, we had a total of over 170-email exchanges, plus Skype. JSON has really simplified lives of both parties, especially in the long run.

Overall, collaborating with Bob has been truly an enjoyable and rewarding experience. The DSSR-Jmol integration also serves as a concrete example of what can be achieved by two dedicated minds with complementary expertise.



Analyzing DNA/RNA structures with Curves+ and 3DNA

Curves+ and 3DNA are currently the most widely used programs for analyzing nucleic acid structures (predominantly double helices). As noted in my blog post, Curves+ vs 3DNA, these two programs also complement each other in terms of features. It thus makes sense to run both to get a better understanding of the DNA/RNA structures one is interested in.

Indeed, over the past few years, I have seen quite a few articles citing both 3DNA and Curves+. Listed below are three recent examples:

The helical parameters were measured with 3DNA33 and Curves+.34 The local helical parameters are defined with regard to base steps and without regard to a global axis.

Structure analysis. Helix, base and base pair parameters were calculated with 3DNA or curve+ software packages23,24.

The major global difference between the native and mixed backbone structures is that the RNA backbone is compressed or kinked in strands containing the modified linkage (Fig. 3 B and C, by CURVES) (30). … To compare the three RNA structures at a more detailed and local level, we calculated the base pair helical and step parameters for all three structures using the 3DNA software tools (31) (Fig. 4 and Table S2). [In the Results section]

For each snapshot, the structural parameters—including six base pair parameters, six local base pair step parameters, and pseudorotation angles for each nucleotide—were calculated using 3DNA (31). The two terminal base pairs are omitted for the 3DNA analysis, because they unwind frequently in the triple 2′-5′-linked duplex. [In the Materials and Methods section]

Reading through these papers, however, it is not clear to me if the authors took advantage of the find_pair -curves+ option in 3DNA, as detailed in Building a bridge between Curves+ and 3DNA. Hopefully, this post will help draw more attention to this connection between Curves+ and 3DNA.



DSSR --symmetry/--nmr options and MODEL/ENDMDL ensemble

Over the past couple of weeks, I’ve added two more DSSR options, --symmetry and --nmr, that are closely related to an ensemble of MODEL/ENDMDL-delineated structures in PDB files. However, there exist subtle differences between the two cases, and the usage of the same MODEL/ENDMDL ensemble format can be ambiguous to the uninitiated. This blog post aims to clarify the issues, using concrete examples.

The --symmetry options applies to X-ray crystal structures where an asymmetric unit represents only part of the whole biological assembly. In standard PDB format, the asymmetric unit contains instructions to produce crystallographic symmetry related molecules.. Nevertheless, the biological assembly are also provided by the PDB (or NDB), with coordinate files ending with .pdb1 or such. For example, the PDB entry 2d94 has the single-stranded sequence GGGCGCCC in its asymmetric unit (2d94.pdb). It is the biological assembly in file 2d94.pdb1 that contains the DNA double helix.

x3dna-dssr -i=2d94.pdb # no pairs found
x3dna-dssr -i=2d94.pdb1 # still no pairs found
x3dna-dssr -i=2d94.pdb1 --symm # 8 pairs found
x3dna-dssr -i=2d94.pdb --symm # no pairs found

As shown by the above examples, DSSR by default reads only the first model even given the biological assemble file 2d94.pdb1. It is with --symmetry (abbreviated to --symm) explicitly specified that DSSR takes all models in the input biological assemble file into consideration. The last case also illustrates that DSSR does not generate crystallographic symmetry related molecules. The --symm simply informs DSSR to take all models, which already exist in the input file, into consideration.

On the other hand, the --nmr option is for auto-processing an ensemble of structures solved by solution NMR method (or trajectories of molecular dynamics simulations). The key point here is that each of the MODEL/ENDMDL-delinated structures is independent and thus can be processed separately, even though they are obviously closely related. Using the PDB entry 2n2d as an example, here are some sample usages:

x3dna-dssr -i=2n2d.pdb -o= 2n2d-first.out # only the first structure is processed
x3dna-dssr -i=2n2d.pdb --nmr -o=2n2d-all.out # all 10 structures are processed
x3dna-dssr -i=2n2d.pdb --nmr --json -o=2n2d-all.json # ibid., with output in JSON

Note that the NMR file is named 2n2d.pdb, and it contains 10 structures.

Interesting mixes show up when an X-ray biological assembly with multiple MODEL/ENDMDL entries is analyzed with --nmr, or an NMR entry is handled with --symmetry. Here are two such examples:

x3dna-dssr -i=2d94.pdb1 --nmr -o=temp # models 1 and 2 are handled sepatately
x3dna-dssr -i=2n2d.pdb --symm -o=temp # wrong -- does not make sense!

In summary, the --symmetry option is intended to treat symmetry-related molecules as a whole, as in a biological assembly of X-ray crystal structures. In contrast, the --nmr option aims to automate the analysis of each structure in a MODEL/ENDMDL-delineated ensemble, as in NMR structures or trajectories of MD simulations. The distinction between the two MODEL/ENDMDL usages is most clearly seen via a molecular visualization program: for example, check the figure below for 2d94.pdb1 (left) and 2n2d.pdb (right) when all frames are selected using Jmol.

2d94 (2 models) 2n2d (10 models)
biological assembly of a DNA duplex (2d94) solution structure of a DNA quadruplex (2n2d)



The 3DNA mutate_bases program is cited in Nature

It was a nice surprise to notice the following 3DNA citation in a Nature article, titled Selective small-molecule inhibition of an RNA structural element (doi:10.1038/nature15542). Moreover, the work came from Merck Research Laboratories, reporting a novel selective chemical modulator (ribocil) to repress riboswitch-mediated ribB gene expression and inhibit bacterial cell growth.

Homology modelling. A homology model of the E. coli FMN aptamer was constructed using program mutate_bases53 of the 3DNA package using the F. nucleatum impX riboswitch aptamer X-ray structure as the template and the FMN aptamer alignment of E. coli, F. nucleatum, P. aeruginosa and A. baumannii (Extended Data Fig. 5). All nucleotide insertions in the E. coli sequence were removed in the model (Extended Data Fig. 5). There are 34 base changes among the 111 nucleotides modelled. Base pairing when present remains consistent. Energy minimization at A92 was performed to avoid VDW clashes using Macromodel (Schrodinger, LLC).

In retrospect, the mutate_bases program was created in response to repeated requests from 3DNA users, initially mostly for modeling DNA-protein complexes. The program was first coded as a Perl script, and later on rewritten in ANSI C for efficiency. Since v2.1, mutate_bases has become an essential component of 3DNA, on a par with find_pair, analyze, rebuild and fiber etc. As I noted in the post documenting the program

Overall, mutate_bases has been designed to solve the in silico base mutation problem in a practical sense: robust and efficient, getting its job done and then out of the way. The program can have many possible applications: in addition to perform base-pair mutations in DNA-protein complexes, it should also prove handy in RNA modeling and in providing initial structures for QM/MM/MD energy calculations, and in DNA/RNA modeling studies.

The Merck Nature paper is the first time ever that the 3DNA mutate_bases program has been put in the spotlight. Hopefully more such applications/citations will appear in the future as the community begin to appreciate the value of this little gem.



Parsing DSSR json output

JSON (JavaScript Object Notation) is a simple human-readable format that expresses data objects in name-value pairs. Over the years, it has surpassed XML to become the preferred data exchange format between applications. As a result, I’ve recently added the --json command-line option to DSSR to make its numerous derived parameters easily accessible.

The DSSR JSON output is contained in a compact one-line text string that may look cryptic to the uninitiated. Yet, with commonly available JSON parsers or libraries, it is straightforward to make sense of the DSSR JSON output. In this blogpost, I am illustrating how to parse DSSR-derived .json file via two command-line tools, jq and Underscore-CLI.

jq — lightweight and flexible command-line JSON processor

According to its website,

jq is like sed for JSON data – you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.

Moreover, like DSSR per se, “jq is written in portable C, and it has zero runtime dependencies.” Prebuilt binaries are available for Linux, OS X and Windows. So it is trivial to get jq up and running. The current stable version is 1.5, released on August 15, 2015.

Using the crystal structure of yeast phenylalanine tRNA (1ehz) as an example, here are some sample usages with DSSR-derived JSON output:

    # Pretty print JSON
x3dna-dssr -i=1ehz.pdb --json | jq .
    # Extract the top-level keys, in insertion order 
x3dna-dssr -i=1ehz.pdb --json | jq keys_unsorted
    # Extract parameters for nucleotides
x3dna-dssr -i=1ehz.pdb --json | jq .nts
    # Extract nucleotide id and its base reference frame
x3dna-dssr -i=1ehz.pdb --json | jq '.nts[] | (.nt_id, .frame)'

Underscore-CLI — command-line utility-belt for hacking JSON and Javascript.

Underscore-CLI is built upon Node.js, and can be installed using the npm package manager. It is claimed as ‘the “swiss-army-knife” tool for processing JSON data – can be used as a simple pretty-printer, or as a full-powered Javascript command-line.’

Following the above examples illustrating jq, here are the corresponding commands for Underscore-CLI:

x3dna-dssr -i=1ehz.pdb --json | underscore print --color
x3dna-dssr -i=1ehz.pdb --json | underscore keys --color
x3dna-dssr -i=1ehz.pdb --json | underscore select .nts --color
x3dna-dssr -i=1ehz.pdb --json | underscore select .nts | underscore select '.nt_id, .frame' --color

jq or Underscore-CLI — which one to use?

As always, it depends. While jq feels more like a standard Unix utility (as sed, awk, grep etc), Underscore-CLI is better integrated into the Javascript language. For simple applications such as parsing DSSR output, either jq or Underscore-CLI is more than sufficient.

I use jq most of the time, but resort to Underscore-CLI for its “smart whitespace”. Here is an example to illustrate the difference between the two:

# z-axis of A.G1 (1ehz) base reference frame
# jq output, split in 5 lines
    "z_axis": [
# Underscore-CLI, in a more-readable one line
    "z_axis": [0.799, 0.488, -0.352]



Simple base-pair parameters

Recently, I read with great interest an article titled A context-sensitive guide to RNA & DNA base-pair & base-stack geometry by Dr. Jane Richardson, published in CCN (Computational Crystallography Newsletter, 2015, 5, 42—49). Highlighted in the article are Buckle and Propeller twist (see bottom left of the figure below), two of the angular parameters that characterize base-pair (bp) non-planarity. Particularly, I was intrigued by the “Notes on measures and figures” at the end:

Base normals were constructed in Mage (Richardson 2001) and twist torsions and buckle angles were measured from them; propeller-twists were measured as dihedral angles around an axis between N1/9 atoms.

Schematic diagram of six rigid-body base-pair parameters

The Richardson CCN article prompted me to think more on intuitive description of bp geometry that can be easily grasped by experimentalist, especially X-ray crystallographers or cryo-EM practitioners. Without worrying about model building as with the six rigid-body parameters, it is straightforward to come up with a new set of four ‘simple’ parameters (Shear, Stretch, Buckle and Propeller) with the following characteristics:

  • Each parameter can be positive or negative. For type M–N pairs (as in the canonical cases), Shear and Buckle reverse their signs when the two bases are swapped (i.e. counted as N–M instead of M–N). In all other cases, the signs of the parameters remain unchanged. See the DSSR paper for the definition of M+N vs M–N type of pairs.
  • Intuitive results for non-canonical pairs, even when Opening is ~180º.
  • Consistent definition between Shear/Buckle (x-axis) vs Stretch/Propeller (y-axis).
  • As in 3DNA and DSSR, Buckle^2 + Propeller^2 = interBase_angle^2. Either Buckle or Propeller can render the two base planes of a pair non-parallel. Combined together, they introduce a non-zero inter-base angle. By definition, each parameter should not be larger than the overall inter-base angle.

With the cartoon-block representation introduced in DSSR, base-stacking interactions and bp deformations (especially Buckle and Propeller) are immediately obvious. Two example are illustrated in the figure below: one is the classic Dickerson B-DNA dodecamer (355d, DSSR output), and the other is the parallel double-stranded helix of poly(A) RNA (4jrd, DSSR output).

Dickerson B-DNA dodecamer (355d) in cartoon-block representation Parallel double-stranded helix of poly(A) RNA (4jrd)
DSSR Output for 355d DSSR Output for 4jrd

A portion of DSSR output for the B-DNA duplex 355d is shown below. Note that the first bp (at the bottom left in the figure above) has a Propeller of –17º (and a Buckle of +7º). As beautifully explained by Calladine et al. in their book Understanding DNA, The Molecule & How It Works, Watson-Crick pairs prefer to have negative Propeller in right-handed DNA double helices to improve same-strand base-stacking interactions. The average value of Propeller in A- and B-DNA crystal structures is around –11º (see Table 3 of the Olson et al. standard base reference frame paper).

     nt1            nt2           bp  name        Saenger    LW  DSSR
   1 A.DC1          B.DG24         C-G WC          19-XIX    cWW  cW-W
       [-105.9(anti) ~C2'-endo lambda=53.5] [-141.3(anti) ~C3'-endo lambda=52.7]
       d(C1'-C1')=10.71 d(N1-N9)=8.96 d(C6-C8)=9.88 tor(C1'-N1-N9-C1')=-21.4
       H-bonds[3]: "O2(carbonyl)-N2(amino)[2.83],N3-N1(imino)[2.90],N4(amino)-O6(carbonyl)[2.98]"
       interBase-angle=19  Simple-bpParams: Shear=0.28 Stretch=-0.13 Buckle=7.3 Propeller=-17.2
       bp-pars: [0.28    -0.14   0.07    6.93    -17.31  -0.61]
   2 A.DG2          B.DC23         G-C WC          19-XIX    cWW  cW-W
       [-85.4(anti) ~C2'-endo lambda=53.4] [-150.3(anti) ~C3'-endo lambda=55.4]
       d(C1'-C1')=10.61 d(N1-N9)=8.92 d(C6-C8)=9.83 tor(C1'-N1-N9-C1')=-21.7
       H-bonds[3]: "O6(carbonyl)-N4(amino)[2.91],N1(imino)-N3[2.88],N2(amino)-O2(carbonyl)[2.88]"
       interBase-angle=17  Simple-bpParams: Shear=-0.24 Stretch=-0.18 Buckle=9.0 Propeller=-14.5
       bp-pars: [-0.24   -0.18   0.49    9.34    -14.30  -2.08]

A portion of DSSR output for the parallel A-DNA duplex 4jrd is shown below. Note that the values of ‘simple’ Propeller are positive for both bps #7 and #8. In contrast, the rigid-body bp parameters have their signs flipped over when Opening is switched from –179.56º for bp#7 to +179.23º for bp#8. This sign ‘ambiguity’ around 180º Opening could be confusing. Yet, all the six bp parameters must be kept as they are for rigorous rebuilding, especially within a larger context than a bp per se. From the very beginning, 3DNA has adopted the convention of keeping angular parameters in the range of [–180º, +180º] instead of [0, 360º], allowing left-handed Z-DNA to have negative twist.

   7 A.A8           B.A7           A+A --          02-II     tHH  tM+M
       [-175.8(anti) ~C3'-endo lambda=10.2] [-172.7(anti) ~C3'-endo lambda=12.6]
       d(C1'-C1')=11.15 d(N1-N9)=8.29 d(C6-C8)=6.31 tor(C1'-N1-N9-C1')=160.1
       H-bonds[4]: "OP2-N6(amino)[2.97],N7-N6(amino)[2.97],N6(amino)-OP2[2.92],N6(amino)-N7[2.91]"
       interBase-angle=14  Simple-bpParams: Shear=-7.88 Stretch=0.66 Buckle=-7.8 Propeller=11.9
       bp-pars: [-6.00   5.15    -0.02   0.63    14.22   -179.56]
   8 A.A9           B.A8           A+A --          02-II     tHH  tM+M
       [-177.4(anti) ~C3'-endo lambda=12.4] [-175.8(anti) ~C3'-endo lambda=10.3]
       d(C1'-C1')=11.01 d(N1-N9)=8.15 d(C6-C8)=6.18 tor(C1'-N1-N9-C1')=158.5
       H-bonds[4]: "OP2-N6(amino)[2.93],N7-N6(amino)[2.88],N6(amino)-OP2[2.97],N6(amino)-N7[2.92]"
       interBase-angle=15  Simple-bpParams: Shear=-7.91 Stretch=0.56 Buckle=-7.0 Propeller=13.7
       bp-pars: [6.11    -5.06   -0.05   -2.26   -15.22  179.23]



Updates on

From early on, the domain and its related sub-domains (e.g., for the forum and the web-interface to DSSR) has been served via shared hosting. By and large, this simple arrangement has worked quite well. Over the years, though, I’ve gradually realized some of its inherent limitations. One is the limited resources available to the 3DNA-related websites. Another is the accessibility issue from countries like China.

To remedy such issues, I’ve recently moved the 3DNA Forum and the web-interface to DSSR to a dedicated web server at Columbia University. Moreover, a duplicate copy of the 3DNA homepage is made available via hosted at Columbia. The three new websites have been verified to be accessible directly from China.

These updates on not only ensure global accessibility to 3DNA/DSSR, but also allow for more web services to be made available.



Output of reference frames in DSSR JSON output

As of v1.3.3-2015sep03, DSSR outputs the reference frame of any base or base-pair (bp). With an explicit list of such reference frames, one can better understand how the 3DNA/DSSR bp parameters are calculated. Moreover, third-party bioinformatics tools can take advantage of the frames for further exploration of nucleic acid structures, including visualization.

Let’s use the G1–C72 bp (detailed below) in the yeast phenylalanine tRNA (1ehz) as an example:

1 A.G1           A.C72          G-C WC          19-XIX    cWW  cW-W

The standard base reference frame for A.G1 is:

  rsmd: 0.008,
  origin: [53.757, 41.868, 52.93],
  x_axis: [-0.259, -0.25, -0.933],
  y_axis: [-0.543, 0.837, -0.073],
  z_axis: [0.799, 0.488, -0.352]

And the one for A.C72 is:

  rsmd: 0.006,
  origin: [53.779, 42.132, 52.224],
  x_axis: [-0.402, -0.311, -0.861],
  y_axis: [0.451, -0.886, 0.109],
  z_axis: [-0.797, -0.345, 0.497]

The G1–C72 bp reference frame is:

  rsmd: null,
  origin: [53.768, 42, 52.577],
  x_axis: [-0.331, -0.283, -0.9],
  y_axis: [-0.497, 0.863, -0.089],
  z_axis: [0.802, 0.418, -0.427]

The beauty of the DSSR JSON output is that the above information can be extracted on the fly. For example, the following commands extract the above frames:

x3dna-dssr -i=1ehz.pdb --json | jq '.ntParams[] | select(.nt_id=="A.G1") | .frame'
x3dna-dssr -i=1ehz.pdb --json | jq '.ntParams[] | select(.nt_id=="A.C72") | .frame'
x3dna-dssr -i=1ehz.pdb --json --more | jq .pairs[0].frame

Note that in JSON, the array is 0-indexed, so the first bp (G1–C72) has an index of 0. In addition to jq, I also used underscore to pretty-print the frames.



Quantifying base-pair geometry by six rigid-body parameters

Standard nitrogenous bases in DNA and RNA (A, C, G, T, and U) are aromatic compounds, each with a planar geometry. In the analyses of three-dimensional (3D) nucleic acid structures, the planar bases are normally taken as rigid bodies. The relative geometry of the two bases in base pair (bp) can then be rigorously quantified by six rigid-body parameters (see figure below). The three translations along the x-, y- and z-axes are termed Shear, Stretch, and Stagger, respectively. The three corresponding rotations are called Buckle, Propeller (twist), and Opening.

Schematic diagram of six rigid-body base-pair parameters

3DNA is unique with its coupled analyze and rebuild programs. The former calculates six bp parameters given 3D atomic coordinates (in PDB or PDBx/mmCIF format), while the later takes a set of such parameters to generate the corresponding structure. The rigor of the description can be easily verified in two equivalent ways: the close to zero root-mean-square deviation (RMSD) between the rebuilt structure and the original coordinates, after a least-squares superposition; or the identical six bp parameters when the rebuilt structure is analyzed.

As is often the case, a concrete example would make the point clear. Here I am using the reverse Hoogsteen (rHoogsteen) bp between U8 and A14 (see image below) in the yeast phenylalanine tRNA (1ehz) as an example. The PDB atomic coordinates of the U8–A14 rHoogsteen pair, excluding backbone atoms except for C1′, is stored in file 1ehz-U8-A14.pdb.

The reverse Hoogsteen U8–A14 base pair in tRNA (1ehz)

find_pair 1ehz-U8-A14.pdb stdout | analyze stdin
    # bp parameters in file '1ehz-U8-A14.out'
    # also generated 'bp_step.par' for rebuilding below
rebuild -atomic bp_step.par 1ehz-U8-A14-3DNA.pdb
    # rmsd is 0.044 Å between '1ehz-U8-A14.pdb' and '1ehz-U8-A14-3DNA.pdb'
find_pair 1ehz-U8-A14-3DNA.pdb stdout | analyze stdin
    # bp parameters of the rebuilt structure in '1ehz-U8-A14-3DNA.out'
rebuild -atomic bp_step.par 1ehz-U8-A14-3DNA-new.pdb
    # rmsd is 0 Å between '1ehz-U8-A14-3DNA.pdb' and '1ehz-U8-A14-3DNA-new.pdb'

Note that the above commands should be performed in order, since the file bp_step.par is overwritten after each analyze run. For your verification, here are the links to the five files:

The 0.044 Å rmsd between the original PDB coordinates in 1ehz-U8-A14.pdb and the 3DNA rebuilt structure in 1ehz-U8-A14-3DNA.pdb is due to the slight non-planarity of experimental bases. The rmsd is 0 between the two rounds of 3DNA rebuilt structures, 1ehz-U8-A14-3DNA.pdb and 1ehz-U8-A14-3DNA-new.pdb, as expected.

The bp parameters in 1ehz-U8-A14.out and 1ehz-U8-A14-3DNA.out are identical, as expected, and they are shown below.

Local base-pair parameters
     bp        Shear    Stretch   Stagger    Buckle  Propeller  Opening
    1 U-A       4.14     -1.91      0.77     -4.62     12.12   -103.09 

Running DSSR on 1ehz-U8-A14.pdb gives the following results. Note that the six bp parameters (last row prefixed with bp-pars) are the exactly same as in 3DNA — we are consistent.

# x3dna-dssr -i=1ehz-U8-A14.pdb --more
List of 1 base pair
      nt1            nt2           bp  name        Saenger    LW  DSSR
   1 A.U8           A.A14          U-A rHoogsteen  24-XXIV   tWH  tW-M
       [n/a(n/a) ---- lambda=28.3] [n/a(n/a) ---- lambda=21.5]
       d(C1'-C1')=9.63 d(N1-N9)=7.06 d(C6-C8)=6.00 tor(C1'-N1-N9-C1')=174.4
       H-bonds[2]: "O2(carbonyl)-N6(amino)[3.00],N3(imino)-N7[2.74]"
       interBase-angle=12.97  Simple-bpParams: Shear=4.28 Stretch=1.55 Buckle=-11.8 Propeller=5.4
       bp-pars: [4.14    -1.91   0.77    -4.62   12.12   -103.09]

As mentioned in the recent DSSR paper:

As in 3DNA (6,7), DSSR takes advantage of the six standard base-pair parameters––three translations (Shear, Stretch, Stagger) and three rotations (Buckle, Propeller, Opening)––to quantify the relative spatial position and orientation of any two interacting bases rigorously. Among the six parameters, only Shear, Stretch, and Opening are critical for characterizing different types of pairs. Buckle, Propeller and Stagger, on the other hand, describe the nonplanarity of a given pair (6). By virtue of the definition of the standard base reference frame, Shear, Stretch, and Opening are all close to zero for Watson-Crick pairs. Moreover, every other type of pair has a set of characteristic parameters. For example, the wobble G–U pair is characterized by an average Shear of –2.2 Å, and the Hoogsteen A+U pair is distinguished by a Stretch of approximately –3.5 Å and an Opening of near 66º.

In a follow-up post, I will talk about the “simple” bp parameters (Simple-bpParams in the above DSSR output list) recently introduced into DSSR — stay tuned!



DSSR output in JSON format

As of DSSR v1.3.0-2015aug27, the --json option is available for producing analysis results that is strictly compliant with the JSON data exchange format. The JSON file contains numerous DSSR-derived structural features, including those in the default main output, backbone torsions in dssr-torsions.txt, and a detailed list of hydrogen bonds.

According to the official JSON website:

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language… JSON is a text format that is completely language independent… These properties make JSON an ideal data-interchange language.

Indeed, the JSON output file makes DSSR readily accessible for integration with other bioinformatics tools or normal usages from the command line. Using the classic yeast phenylalanine tRNA 1ehz as an example (1ehz.pdb), let’s go over some simple use-cases. Note that the following examples take advantage of jq, a lightweight and flexible command-line JSON processor.

x3dna-dssr -i=1ehz.pdb --json -o=1ehz-dssr.json
jq . 1ehz-dssr.json  # reformatted for pretty output
x3dna-dssr -i=1ehz.pdb --json | jq .  # the above 2 steps combined

With 1ehz-dssr.json in hand, we can easily extract DSSR-derived structural features of interest:

jq .pairs 1ehz-dssr.json   # list of 34 pairs
jq .multiplets 1ehz-dssr.json  # list of 4 base triplets
jq .hbonds 1ehz-dssr.json  # list of hydrogen bonds
jq .helices 1ehz-dssr.json
jq .stems 1ehz-dssr.json
  # list of nucleotide parameters, including torsion angles and suites
jq .ntParams 1ehz-dssr.json
  # list of 14 modified nucleotides
jq '.ntParams[] | select(.is_modified)' 1ehz-dssr.json
  # select nucleotide id, delta torsion, sugar puckering and cluster of suite name
jq '.ntParams[] | {nt_id, delta, puckering, cluster}' 1ehz-dssr.json
  # same selection as above, but in 'Comma Separated Values' format
jq -r '.ntParams[] | [.nt_id, .delta, .puckering, .cluster] | @csv' 1ehz-dssr.json

Here is the result of running jq (v1.5) to select multiplets:

# jq .multiplets 1ehz-dssr.json
    "index": 1,
    "num_nts": 3,
    "nts_short": "UAA",
    "nts_long": "A.U8,A.A14,A.A21"
    "index": 2,
    "num_nts": 3,
    "nts_short": "AUA",
    "nts_long": "A.A9,A.U12,A.A23"
    "index": 3,
    "num_nts": 3,
    "nts_short": "gCG",
    "nts_long": "A.2MG10,A.C25,A.G45"
    "index": 4,
    "num_nts": 3,
    "nts_short": "CGg",
    "nts_long": "A.C13,A.G22,A.7MG46"

With the JSON file, DSSR can now be connected with the bioinformatics community in a ‘structured’ way, with a clearly delineated boundary. Now I can enjoy the freedom of refining the default main output format, without worrying too much about breaking third-party parsers. Moreover, I no longer need to write an adapter for each integration of DSSR with other tools. So nice!

For your reference, here is the output file 1ehz-dssr.json. It may be possible that the identifiers (names) of the JSON output will be refined in the next few iterations. I welcome your comments to make the DSSR-derived JSON better suite your needs.



DSSR has been published, and its results are reproducible

It is a great pleasure to note that a paper titled DSSR, an integrated software tool for dissecting the spatial structure of RNA has recently been published in Nucleic Acids Research (NAR). Co-authored by Harmen Bussemaker, Wilma Olson and me (a team with a unique combination of complementary expertise), this DSSR paper represents another solid piece of work that I feel proud of. In contrast to our previous GpU dinucleotide platform paper focusing on results, and the two major 3DNA papers concentrating on methods, the current NAR article describes significant scientific findings that are enabled by the novel analysis algorithms implemented in the program. Moreover, DSSR introduces an appealing and highly informative “cartoon-block” representation of RNA structures that combines PyMOL cartoon schematics with 3DNA base color-coded rectangular blocks.

The abstract of the paper is quoted below:

Insight into the three-dimensional architecture of RNA is essential for understanding its cellular functions. However, even the classic transfer RNA structure contains features that are overlooked by existing bioinformatics tools. Here we present DSSR (Dissecting the Spatial Structure of RNA), an integrated and automated tool for analyzing and annotating RNA tertiary structures. The software identifies canonical and noncanonical base pairs, including those with modified nucleotides, in any tautomeric or protonation state. DSSR detects higher-order coplanar base associations, termed multiplets. It finds arrays of stacked pairs, classifies them by base-pair identity and backbone connectivity, and distinguishes a stem of covalently connected canonical pairs from a helix of stacked pairs of arbitrary type/linkage. DSSR identifies coaxial stacking of multiple stems within a single helix and lists isolated canonical pairs that lie outside of a stem. The program characterizes ‘closed’ loops of various types (hairpin, bulge, internal, and junction loops) and pseudoknots of arbitrary complexity. Notably, DSSR employs isolated pairs and the ends of stems, whether pseudoknotted or not, to define junction loops. This new, inclusive definition provides a novel perspective on the spatial organization of RNA. Tests on all nucleic acid structures in the Protein Data Bank confirm the efficiency and robustness of the software, and applications to representative RNA molecules illustrate its unique features. DSSR and related materials are freely available at

During the review process, we are delighted that the referees confirmed the claim that we made in the cover letter: “We would also like to emphasize that our reported results are easily verifiable, and we assure rigorous reproducibility of the data and figures described in this article.” Now that the paper has been published, as a follow-up, I’ve made available all the scripts and data files associated with the paper in a new section DSSR-NAR paper on the 3DNA Forum. The DSSR User Manual has also been updated with additional, previously undocumented, auxiliary options.

Overall, it took me more than ten days to create the 19 posts in the DSSR-NAR paper section and to revise the DSSR User Manual, along with other minor refinements for consistency. During the process, I’ve tried to make the scripts and data files self-contained for wide accessibility and easy understanding.

Any interested party should now be able to reproduce the table and figures (including the supplementary data) reported in the article. Moreover, with the additional details given in the post RNA cartoon-block representations with PyMOL and DSSR, one can easily generate similar schematic images as shown below:

yeast phenylalanine tRNA (1ehz) with base blocks yeast phenylalanine tRNA (1ehz) with WC base-pair blocks
1msy: with the minor groove edge (black) of the C-G pair that closes the GUAA tetraloop facing the viewer 27-nt rRNA fragment with GUAA tetraloop (1msy) -- base blocks in outline

I feel confident to claim that the results reported in our DSSR paper are reproducible. If you have issues related to the paper, please post them on the 3DNA Forum. I strive to respond promptly to any questions asked there.

In summary, DSSR is an integrated computational tool, designed from the bottom up to streamline the analysis of RNA three-dimensional structures. It is built upon my extensive experience in supporting 3DNA, growing knowledge of RNA structures, and refined programming skills. DSSR has a combined set of functionalities well beyond the scope of any known specialized resources. The program may well serve as a cornerstone for RNA structural bioinformatics and will benefit a broad range of possible applications.



Conformation of the sugar ring in nucleic acid structures

The conformation of the five-membered sugar ring in DNA/RNA structures can be characterized using the five corresponding endocyclic torsion angles (shown below).

Sugar torsion angles

v0: C4'-O4'-C1'-C2'
v1: O4'-C1'-C2'-C3'
v2: C1'-C2'-C3'-C4'
v3: C2'-C3'-C4'-O4'
v4: C3'-C4'-O4'-C1'

On account of the five-member ring constraint, the conformation can be characterized approximately by 5 - 3 = 2 parameters. Using the concept of pseudorotation of the sugar ring, the two parameters are the amplitude (τm) and phase angle (P, in the range of 0° to 360°).

One set of widely used formula to convert the five torsion angles to the pseudorotation parameters is due to Altona & Sundaralingam (1972): “Conformational Analysis of the Sugar Ring in Nucleosides and Nucleotides. A New Description Using the Concept of Pseudorotation” [J. Am. Chem. Soc., 94(23), pp 8205–8212]. As always, the concept is best illustrated with an example. Here I use the sugar ring of G4 (chain A) of the Dickerson-Drew dodecamer (1bna), with Matlab/Octave code:

# xyz coordinates of the sugar ring: G4 (chain A), 1bna
ATOM     63  C4'  DG A   4      21.393  16.960  18.505  1.00 53.00
ATOM     64  O4'  DG A   4      20.353  17.952  18.496  1.00 38.79
ATOM     65  C3'  DG A   4      21.264  16.229  17.176  1.00 56.72
ATOM     67  C2'  DG A   4      20.793  17.368  16.288  1.00 40.81
ATOM     68  C1'  DG A   4      19.716  17.901  17.218  1.00 30.52

# endocyclic torsion angles:
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4
Pconst = sin(pi/5) + sin(pi/2.5)  # 1.5388
P0 = atan2(v4 + v1 - v3 - v0, 2.0 * v2 * Pconst);  # 2.9034
tm = v2 / cos(P0);  # amplitude: 48.469
P = 180/pi * P0;  # phase angle: 166.35 [P + 360 if P0 < 0]

The Altona & Sundaralingam (1972) pseudorotation parameters are what have been adopted in 3DNA, following the NewHelix program of Dr. Dickerson. The Curves+ program, on the other hand, uses another (newer) set of formula due to Westhof & Sundaralingam (1983): “A Method for the Analysis of Puckering Disorder in Five-Membered Rings: The Relative Mobilities of Furanose and Proline Rings and Their Effects on Polynucleotide and Polypeptide Backbone Flexibility” [J. Am. Chem. Soc., 105(4), pp 970–976]. The two sets of formula, by Altona & Sundaralingam (1972) and Westhof & Sundaralingam (1983), give slightly different numerical values for the two pseudorotation parameters (τm and P).

Since 3DNA and Curves+ are currently two of the most widely used programs for conformational analysis of nucleic acid structures, the subtle differences in pseudorotation parameters may cause confusions for users who use (or are familiar with) both programs. Over the past few years, I have indeed received such questions via email.

With the same G4 (chain A, 1bna) sugar ring, here is the Matlab/Octave script showing how Curve+ calculates the pseudorotation parameters:

# xyz coordinates of sugar ring G4 (chain A, 1bna)

# endocyclic torsion angles, same as above
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4

v = [v2, v3, v4, v0, v1]; # reorder them into vector v[]
A = 0; B = 0;
for i = 1:5
    t = 0.8 * pi * (i - 1);
    A += v(i) * cos(t);
    B += v(i) * sin(t);
A *= 0.4;   # -48.476
B *= -0.4;  # 11.516

tm = sqrt(A * A + B * B);  # 49.825

c = A/tm; s = B/tm;
P = atan2(s, c) * 180 / pi;  # 166.64

For this specific example, i.e., the sugar ring of G4 (chain A, 1bna), the pseudorotation parameters as calculated by 3DNA per Altona & Sundaralingam (1972) and Curves+ per Westhof & Sundaralingam (1983) are as follows:

           amplitude        phase angle
3DNA        48.469             166.35
Curves+     49.825             166.64

Needless to say, the differences are subtle, and few people will notice/bother at all. For those who do care about such little details, however, hopefully this post will help you understand where the differences actually come from.


For consistency with the 3DNA output, DSSR (by default) also follows the Altona & Sundaralingam (1972) definitions of sugar pseudorotation. Nevertheless, DSSR also contains an undocumented option, --sugar-pucker=westhof83, to output τm and P according to the Westhof & Sundaralingam (1983) definitions.

Each sugar is assigned into one of the following ten puckering modes, by dividing the phase angle (P, in the range of 0° to 360°) into 36° ranges reach.

C3'-endo, C4'-exo,   O4'-endo, C1'-exo,  C2'-endo,
C3'-exo,  C4'-endo,  O4'-exo,  C1'-endo, C2'-exo

For sugars in nucleic acid structures, C3’-endo [0°, 36°) and C2’-endo [144°, 180°) are predominant. The former corresponds to sugars in ‘canonical’ RNA or A-form DNA, and the latter in sugars of standard B-form DNA. In reality, RNA structures as deposited in the PDB could also contain C2′-endo sugars. One significant example is the GpU dinucleotide platforms, where the 5′-ribose sugar (G) is in the C2′-endo form and the 3′-sugar (U) in the C3′-endo form — see my blog post, titled ‘Is the O2′(G)…O2P H-bond in GpU platforms real?’.



  • This post is based on my 2011-06-11 blog post with the same title.
  • While visiting Lyon in July 2014, I had the opportunity to hear Dr. Lavery’s opinion on adopting the Westhof & Sundaralingam (1983) sugar-pucker definitions in Curves+. I learned that the new formula are more robust in rare, extreme cases of sugar conformation than the 1972 variants. After all, Dr. Sundaralingam is a co-author on both papers. It is possible that in future releases of DSSR, the new 1983 formula for sugar pucker would become the default.



The DSSR --prefix and --cleanup options

In the DSSR v1.2.7-2015jun09 release, I documented two additional command-line options (--prefix and --cleanup) that are related to the various auxiliary files. As a matter of fact, these two options (among quite a few others) have been there for a long time, but without being explicitly described. The point is not to hide but to simplify — one of the design goals of DSSR is simplicity. DSSR has already possessed numerous key functionality to be appreciated. Before DSSR is firmly established in the RNA bioinformatics field, I beleive too many nonessential “features” could be distracting. While writing and refining the DSSR code, I do feel that some ‘auxiliary’ features could be handy for experienced users (including myself). So along the way, I’ve added many ‘hidden’ options that are either experimental or potentially useful.

On one side, I sense it is acceptable for a scientific software to actually does more than it claims. On the other hand, I have always been quick in addressing users’ requests — as one example, check for the --select option recently introduced into DSSR in response to a user request, and the ‘hidden’ --dbn-break option for specifying the symbol to separate multiple chains or chain breaks in DSSR-derived dot-bracket notation.

Back to --prefix and --cleanup, the purposes of these two closely related options can be best illustrated using the yeast phenylalanine tRNA structure (1ehz) as an example. By default, running x3dna-dssr -i=1ehz.pdb will produce a total of 11 auxiliary files, with names prefixed with dssr-, as shown below:

List of 11 additional files
   1 dssr-stems.pdb -- an ensemble of stems
   2 dssr-helices.pdb -- an ensemble of helices (coaxial stacking)
   3 dssr-pairs.pdb -- an ensemble of base pairs
   4 dssr-multiplets.pdb -- an ensemble of multiplets
   5 dssr-hairpins.pdb -- an ensemble of hairpin loops
   6 dssr-junctions.pdb -- an ensemble of junctions (multi-branch)
   7 dssr-2ndstrs.bpseq -- secondary structure in bpseq format
   8 dssr-2ndstrs.ct -- secondary structure in connect table format
   9 dssr-2ndstrs.dbn -- secondary structure in dot-bracket notation
  10 dssr-torsions.txt -- backbone torsion angles and suite names
  11 dssr-stacks.pdb -- an ensemble of stacks

With ‘fixed’ generic names by default, users can run DSSR in a directory repeatedly without creating too many files. This practice follows that used in the 3DNA suite of programs. However, my experience in supporting 3DNA over the years has shown that users (myself included) may want to explore further some of the files, e.g. ‘dssr-multiplets.pdb’ for displaying the base multiplets (four triplets here). One could easily use command-line (script) to change a generic name to a more appropriate one: e.g., mv dssr-multiplets.pdb 1ehz-multiplets.pdb for 1ehz. A better solution, however, is by introducing a customized prefix to the additional files, and that’s exactly where the --prefix option comes in. The option is specified like this: --prefix=text where text can be any string as appropriate. So running x3dna-dssr -i=1ehz.pdb --prefix=1ehz, for example, will lead to the following output:

List of 11 additional files
   1 1ehz-stems.pdb -- an ensemble of stems
   2 1ehz-helices.pdb -- an ensemble of helices (coaxial stacking)
   3 1ehz-pairs.pdb -- an ensemble of base pairs
   4 1ehz-multiplets.pdb -- an ensemble of multiplets
   5 1ehz-hairpins.pdb -- an ensemble of hairpin loops
   6 1ehz-junctions.pdb -- an ensemble of junctions (multi-branch)
   7 1ehz-2ndstrs.bpseq -- secondary structure in bpseq format
   8 1ehz-2ndstrs.ct -- secondary structure in connect table format
   9 1ehz-2ndstrs.dbn -- secondary structure in dot-bracket notation
  10 1ehz-torsions.txt -- backbone torsion angles and suite names
  11 1ehz-stacks.pdb -- an ensemble of stacks

The --cleanup option, as its name implies, is to tidy up a directory by removing the auxiliary files generated by DSSR. The usage is very simple:

x3dna-dssr --cleanup
x3dna-dssr --cleanup --prefix=1ehz

The former gets rid of the default ‘fixed’ generic auxiliary files (dssr-pairs.pdb etc), whilst the latter deletes prefixed supporting files (1ehz-pairs.pdb etc).



Assignment of HETATM vs. ATOM records for modified nucleotides in PDB vs. PDBx/mmCIF format

Recently, I came across and have been surprised by the different assignment of HETATM vs. ATOM records for modified nucleotides in PDB vs. PDBx/mmCIF format. As always, the issue is best illustrated with a concrete example. Here is what I observed in the PDB entry 1ehz, the crystal structure of yeast phenylalanine tRNA at 1.93 Å resolution.

DSSR identifies 14 modified nucleotides (of 11 types) in 1ehz as shown below:

List of 11 types of 14 modified nucleotides
      nt    count  list
   1 1MA-a    1    A.1MA58
   2 2MG-g    1    A.2MG10
   3 5MC-c    2    A.5MC40,A.5MC49
   4 5MU-t    1    A.5MU54
   5 7MG-g    1    A.7MG46
   6 H2U-u    2    A.H2U16,A.H2U17
   7 M2G-g    1    A.M2G26
   8 OMC-c    1    A.OMC32
   9 OMG-g    1    A.OMG34
  10 PSU-P    2    A.PSU39,A.PSU55
  11 YYG-g    1    A.YYG37

In file 1ehz.pdb downloaded from RCSB PDB, all the 14 modified nucleotides are assigned as HETATM whereas in 1ehz.cif the corresponding records are ATOM. Here is the excerpt for 1MA58 in PDB format:

HETATM 1252  P   1MA A  58      73.770  67.765  34.057  1.00 30.65           P  
HETATM 1253  OP1 1MA A  58      72.638  67.886  33.105  1.00 32.84           O  
HETATM 1254  OP2 1MA A  58      73.621  68.229  35.450  1.00 29.49           O  
HETATM 1255  O5' 1MA A  58      74.315  66.273  34.254  1.00 28.81           O  
HETATM 1256  C5' 1MA A  58      74.592  65.439  33.080  1.00 29.42           C  
HETATM 1257  C4' 1MA A  58      74.279  63.972  33.383  1.00 33.42           C  
HETATM 1258  O4' 1MA A  58      74.880  63.685  34.667  1.00 32.36           O  
HETATM 1259  C3' 1MA A  58      72.789  63.573  33.509  1.00 35.13           C  
HETATM 1260  O3' 1MA A  58      72.625  62.168  33.250  1.00 36.80           O  
HETATM 1261  C2' 1MA A  58      72.560  63.667  35.012  1.00 34.80           C  
HETATM 1262  O2' 1MA A  58      71.525  62.828  35.506  1.00 36.27           O  
HETATM 1263  C1' 1MA A  58      73.908  63.150  35.551  1.00 33.62           C  
HETATM 1264  N9  1MA A  58      74.284  63.494  36.930  1.00 30.36           N  
HETATM 1265  C8  1MA A  58      73.887  64.574  37.688  1.00 34.55           C  
HETATM 1266  N7  1MA A  58      74.415  64.610  38.899  1.00 33.32           N  
HETATM 1267  C5  1MA A  58      75.204  63.469  38.953  1.00 33.37           C  
HETATM 1268  C6  1MA A  58      76.031  62.941  39.948  1.00 33.58           C  
HETATM 1269  N6  1MA A  58      76.184  63.488  41.134  1.00 41.19           N  
HETATM 1270  N1  1MA A  58      76.708  61.803  39.669  1.00 34.48           N  
HETATM 1271  CM1 1MA A  58      77.649  61.222  40.626  1.00 31.43           C  
HETATM 1272  C2  1MA A  58      76.527  61.216  38.479  1.00 28.43           C  
HETATM 1273  N3  1MA A  58      75.793  61.624  37.453  1.00 31.67           N  
HETATM 1274  C4  1MA A  58      75.142  62.771  37.747  1.00 33.02           C  

The corresponding section in PDBx/mmCIF format is:

ATOM   1252 P  P     . 1MA A 1 58 ? 73.770 67.765 34.057  1.00 30.65  ? ? ? ? ? ? 58  1MA A P     1 
ATOM   1253 O  OP1   . 1MA A 1 58 ? 72.638 67.886 33.105  1.00 32.84  ? ? ? ? ? ? 58  1MA A OP1   1 
ATOM   1254 O  OP2   . 1MA A 1 58 ? 73.621 68.229 35.450  1.00 29.49  ? ? ? ? ? ? 58  1MA A OP2   1 
ATOM   1255 O  "O5'" . 1MA A 1 58 ? 74.315 66.273 34.254  1.00 28.81  ? ? ? ? ? ? 58  1MA A "O5'" 1 
ATOM   1256 C  "C5'" . 1MA A 1 58 ? 74.592 65.439 33.080  1.00 29.42  ? ? ? ? ? ? 58  1MA A "C5'" 1 
ATOM   1257 C  "C4'" . 1MA A 1 58 ? 74.279 63.972 33.383  1.00 33.42  ? ? ? ? ? ? 58  1MA A "C4'" 1 
ATOM   1258 O  "O4'" . 1MA A 1 58 ? 74.880 63.685 34.667  1.00 32.36  ? ? ? ? ? ? 58  1MA A "O4'" 1 
ATOM   1259 C  "C3'" . 1MA A 1 58 ? 72.789 63.573 33.509  1.00 35.13  ? ? ? ? ? ? 58  1MA A "C3'" 1 
ATOM   1260 O  "O3'" . 1MA A 1 58 ? 72.625 62.168 33.250  1.00 36.80  ? ? ? ? ? ? 58  1MA A "O3'" 1 
ATOM   1261 C  "C2'" . 1MA A 1 58 ? 72.560 63.667 35.012  1.00 34.80  ? ? ? ? ? ? 58  1MA A "C2'" 1 
ATOM   1262 O  "O2'" . 1MA A 1 58 ? 71.525 62.828 35.506  1.00 36.27  ? ? ? ? ? ? 58  1MA A "O2'" 1 
ATOM   1263 C  "C1'" . 1MA A 1 58 ? 73.908 63.150 35.551  1.00 33.62  ? ? ? ? ? ? 58  1MA A "C1'" 1 
ATOM   1264 N  N9    . 1MA A 1 58 ? 74.284 63.494 36.930  1.00 30.36  ? ? ? ? ? ? 58  1MA A N9    1 
ATOM   1265 C  C8    . 1MA A 1 58 ? 73.887 64.574 37.688  1.00 34.55  ? ? ? ? ? ? 58  1MA A C8    1 
ATOM   1266 N  N7    . 1MA A 1 58 ? 74.415 64.610 38.899  1.00 33.32  ? ? ? ? ? ? 58  1MA A N7    1 
ATOM   1267 C  C5    . 1MA A 1 58 ? 75.204 63.469 38.953  1.00 33.37  ? ? ? ? ? ? 58  1MA A C5    1 
ATOM   1268 C  C6    . 1MA A 1 58 ? 76.031 62.941 39.948  1.00 33.58  ? ? ? ? ? ? 58  1MA A C6    1 
ATOM   1269 N  N6    . 1MA A 1 58 ? 76.184 63.488 41.134  1.00 41.19  ? ? ? ? ? ? 58  1MA A N6    1 
ATOM   1270 N  N1    . 1MA A 1 58 ? 76.708 61.803 39.669  1.00 34.48  ? ? ? ? ? ? 58  1MA A N1    1 
ATOM   1271 C  CM1   . 1MA A 1 58 ? 77.649 61.222 40.626  1.00 31.43  ? ? ? ? ? ? 58  1MA A CM1   1 
ATOM   1272 C  C2    . 1MA A 1 58 ? 76.527 61.216 38.479  1.00 28.43  ? ? ? ? ? ? 58  1MA A C2    1 
ATOM   1273 N  N3    . 1MA A 1 58 ? 75.793 61.624 37.453  1.00 31.67  ? ? ? ? ? ? 58  1MA A N3    1 
ATOM   1274 C  C4    . 1MA A 1 58 ? 75.142 62.771 37.747  1.00 33.02  ? ? ? ? ? ? 58  1MA A C4    1 

While I have not tested exhaustively, it seems true that PDBx/mmCIF has adopted a different definition of what constitutes a HETATM residue. It is worth noting that results from 3DNA and DSSR/SNAP are not effected by the conflicting assignments.



The value of tiny and self-contained software in the big-data era

Nowadays, “big data” and “big science” are hot topics. They all sound good and certainly come about for a reason. Yet, to transform data to information to knowledge to understanding to wisdom, sophisticated software tools are required. The programs can be big and complicated, or small and self-contained, fitting different purposes. As long as they can get the claimed job done in a robust fashion, size should not be a concern.

Over the years, however, I have seen a trend of bloated software with many (fragile) dependencies in bioinformatics. Some tools are so picky and hard to use/maintain that instead of serving, they become sort of a master. As a more representative example, I recently tried to install an open-source software associated with a paper published just a few years ago in a leading journal. The software has only a few dependencies, yet some of them have already become obsolete. I spent hours each time, on Mac OS X and two versions of Ubuntu Linux, but failed to get it running properly (always abort with error messages). The download page hosting the software has been inactive since around the publication of the paper. Presumably, the PhD student or postdoc who wrote the code had left the lab, and with a paper published, all is done!

As an active practitioner of bioinformatics for well over a decade, I can confidently claim to be well above average in familiarity with Linux/Mac OS X and associated shell programming and make etc tools, and various common scripting and compiled programming languages. Yet, once in a while, I get frustrated when I try to download and install a software tool attached to a paper I am interested in. As I see it, the vast majority of software programs from research labs are publication-oriented — as long a paper is published, it is finished.

From my experience, I always see software as engineering. It needs careful design and great attention to meticulous details. A sophisticated piece of scientific software is a combination of science and engineering. Expertise in domain knowledge is a must, and refined skills in computer programming is indispensable. The DSSR program I created and continuously refined over the past three years represents what a scientific software should be in my believe.

Among other unique features, DSSR is tiny (< 1mb), self-contained (without run-time dependencies) and runs on Windows, Mac OS X, and Linux. Getting DSSR up and running should take only minutes by any one with basic familiarity of common computer systems. I have no doubt that the beauty of being small as represented by DSSR will be gradually appreciated by the community.



Open invitation on writing a DSSR plugin for PyMOL

Over the past few weeks, I’ve had the pleasure to talk to Thomas Holder, the PyMOL Principal Developer at Schrödinger, on possible integration of DSSR into PyMOL. On Tuesday April 21, 2015, I wrote to Thomas:

Last year, I had the please to collaborate with Dr. Robert Hanson to integrate DSSR into Jmol, see I am wondering if you have any interest in connecting DSSR to PyMOL. This will not only benefit both parties, but also bring elaborate analyses of RNA structures to the general audience. As you may be aware, RNA is becoming increasing important, yet the field of RNA structural bioinformatics is lagging (far) behind that of proteins.

After a few meet-ups, we all agree that the DSSR-PyMOL integration project would be meaningful/significant for RNA structural bioinformatics. Moreover, the community not only can benefit from the end result, but also should be able to make direct contributions through the process. On Friday May 08, 2015, Thomas sent out the following open invitation, titled Someone interested in writing a DSSR plugin for PyMOL?, to the PyMOL mailing list:

Is anyone interested in writing a DSSR plugin for PyMOL? DSSR is an integrated software tool for Dissecting the Spatial Structure of RNA ( Among other things, DSSR defines the secondary structure of RNA from 3D atomic coordinates in a way similar to DSSP does for proteins. Most of its output could be translated 1:1 into PyMOL selections, making it available for coloring and other selection based features. A PyMOL plugin could act as a wrapper which runs DSSR for an object or atom selection. Xiang-Jun Lu, the author of DSSR, is also working on base pair visualization (see, similar to (but more advanced) what’s already available from 3DNA (

Xiang-Jun would be happy to collaborate with someone who has experience with Python and the PyMOL API for writing an extension or plugin. Please contact me if this sounds appealing to you.

Get DSSR from
See it hooked up with JSmol:

If you are self-motivated, care about software quality, have expertise in writing PyMOL plugin, and feel the pain in RNA structural analysis/visualization with currently available tools, now it is the time to make a difference. The DSSR/PyMOL project would ideally be composed of a team of dedicated practitioners with complementary skills. We will communicate mostly via email or online forum, in a presumably open and highly interactive way. By working on the project, you will be able to sharpen your skills and make new friends. The end product would not only make RNA structural bioinformatics easier for yourself but also benefit the community at large.



Ambiguous 'analyze' and 'rebuild' program names

From the very beginning, 3DNA contains two key programs, analyze and rebuild, for the analysis and rebuilding of nucleic acid 3D structures. The two names are short and to the point, but with one caveat. They are common verbs that can be easily picked up by other software packages. When 3DNA and such packages are installed on the same machine, naming clashes happen. If the 3DNA bin/ directory is searched afterwards, the analyze or rebuild command may have nothing to do with nucleic acid structures at all. Naturally, this naming ambiguity can lead to confusions and frustrations.

I’ve been aware of the rebuild program name conflict for a long time. Recently, I was surprised by another analyze program on my Mac OS X Yosemite. As shown from the following output, the analyze program seems to be installed via Mac port, and it is about analyzing words in a dictionary file.

~ [540] which analyze
~ [541] analyze -h
correct syntax is:
analyze affix_file dictionary_file file_of_words_to_check
use two words per line for morphological generation

The ambiguous names are exactly the reason that I use x3dna-dssr and x3dna-snap for the two new programs I’ve been working over the past few years. As for the analyze and rebuild programs in 3DNA v2.x, I’d rather leave them as is. 3DNA is now in wide use in other structural bioinformatic pipelines to allow for easy name changes without causing compatibility issues. On a positive side, once you know the problem, fixing it is straightforward. This post is to raise the awareness of the 3DNA user community about such naming conflicts.



Name of base atoms in PDB formats

Canonical bases (A, C, G, T and U) in nucleic acid structures have standard atom names, shown below using the Watson-Crick A–T and G–C pairs. Ring atoms of adenine, for example, are named (N1, C2, N3, C4, C5, C6, N7, C8, N9) respectively.

Watson-Crick base pairs

Four characters are reserved for atom names in the PDB format. The convention, as seen in files downloaded from the RCSB PDB, is to put the two-character base name in the middle, as in .N1.. Note that here each dot (.) is used for a space character to make it stand out.

Long time ago, I became aware a PDB format variant where the base name is left-aligned, as in N1... This case has ever since been properly handled by 3DNA (including DSSR and SNAP). While checking submitted entries to web-DSSR, I recently noticed yet another PDB format variation in labeling base names with the format of ..N1 (i.e., right-aligned). Without taking this special variant of PDB format into consideration, 3DNA/DSSR reported that “no nucleotides found!” Once the issue is known, however, fixing it is straightforward. As of May 4, 2015, 3DNA v2.2, DSSR and SNAP can all handle this special PDB variant correctly.

Over the years, I have come across many PDB variants claimed to compliant with the loosely defined format. If you find 3DNA or DSSR is not working as expected, it is likely the coordinate file in the self-claimed ‘PDB format’ is at fault. Wherever practical, I’ve tried to incorporate as many non-standard variants as possible.



Nucleic acid structures in the RCSB PDB

The NDB (Nucleic Acid Database) is a valuable resource dedicated to “information about experimentally-determined nucleic acids and complex assemblies.” Over the years, however, I’ve gradually switched from NDB to PDB (Protein Data Bank) for my research on nucleic acid structures. NDB is derived from PDB and presumably should contain all nucleic acid structures available in the PDB. However, at the time of this writing (on April 9, 2015), the NDB says: “As of 8-Apr-2015 number of released structures: 7430” and the PDB states “7611 Nucleic Acid Containing Structures”. So PDB has 7611-7430=181 more entries of nucleic acid structures than the NDB, possibly due to a lag in NDB’s processing of newly released PDB structures. Another issue is the inconsistency of the NDB identifier: early entries have e.g. bdl084 for B-DNA (355d in PDB), but now NDB seems to use the same id as the PDB (e.g., 4p5j).

The RCSB PDB maintains a weekly-updated, summary file named pdb_entry_type.txt in pure text format (check here for a list of useful summary files), containing “List of all PDB entries, identification of each as a protein, nucleic acid, or protein-nucleic acid complex and whether the structure was determined by diffraction or NMR.” An excerpt of the file is shown below:

108m    prot    diffraction
109d    nuc     diffraction
109l    prot    diffraction
109m    prot    diffraction
10gs    prot    diffraction
10mh    prot-nuc        diffraction
110d    nuc     diffraction
110l    prot    diffraction
102m    prot    diffraction
103d    nuc     NMR

Specifically, a nucleic acid structure contains the (sub)string nuc in the second field, where prot-nuc means a protein-RNA/DNA complex. This text file is trivial to parse, and the atomic coordinates files (in PDB or PDBx/mmCIF format) for all nucleic acid structures can be automatically downloaded from the RCSB PDB using a script.

It is worth noting that DSSR is checked against all nucleic acid structures in the PDB at the time of each release to ensure that it does not crash. I update my local copy of nucleic acid structures each week, and run DSSR on the new entries. This process not only provides me an opportunity to keep pace with new developments in the field but also allows me to keep refining DSSR as needs arise.



The do_x3dna paper by Kumar and Grubmuller in Bioinformatics

Today, I noticed the paper do_x3dna: A tool to analyze structural fluctuations of dsDNA or dsRNA from molecular dynamics simulations by Kumar and Grubmuller in Bioinformatics (advance access published April 2, 2015). The summary reads:

The do_x3dna package has been developed to analyze the structural fluctuations of DNA or RNA during molecular dynamics simulations. It extends the capability of the 3DNA package to GROMACS MD trajectories and includes new methods to calculate the global-helical axis of DNA and bending fluctuations during simulations. The package also includes a Python module dnaMD to perform and visualize statistical analyses of complex data obtained from the trajectories.

I am aware of the do_x3dna package through the 3DNA Forum, and wrote a post DNA/RNA molecular dynamics trajectory analysis with do_x3dna on September 3, 2014. With this formal publication, the do_x3dna package will be more widely used, and 3DNA is likely to gain more recognition in the increasing relevant MD field.



Modified pseudouridines

Pseudouridine (5-ribosyluracil, PSU) is the most abundant modified nucleotide in RNA. It is unique in that it has a C-glycosidic bond (C-C1′) instead of the N-glycosidic bond (N-C’) common to all other nucleotides, canonical or modified. In 3DNA, the one-letter code for PSU is assigned to the upper case ‘P’, reserving the lower case ‘p’ for its modified variants. Distinguishing PSU from standard U (or T) is important for deriving sensible base-pair parameters and the χ torsion angle.

Pseudouridine (PSU) 3TD -- N3-methylated PSU

Recently, I came across 3TD (see figure above) in PDB entry 5afi. 3TD is a modified variant of PSU, with a methyl group attached to N3. In 3DNA v2.1 v2.1-2015mar11, 3TD is abbreviated to ‘p’ to signify its connection to PSU.

In the list of recognized nucleotides (‘baselist.dat’) distributed with 3DNA, there are two other residues mapped to ‘p’: FHU and P2U (see figure below). As is often the case, it is the chemical structure, not the 3-letter PDB ligand identifier (or even full chemical name), that shows clearly to what 3DNA 1-letter abbreviation a residue matches.

FHU -- 5-fluoro-6-hydroxy-pseudouridine P2U -- 2'-deoxy-pseudouridine



Exterior loop in RNA secondary structure

A single-stranded RNA molecule can fold back onto itself to form various loops delineated by double helical stems, as shown in the figure below [taken from the Nearest Neighbor Database website from the Turner group].

Various loops in RNA secondary structure

Of special note is the exterior loop (at the bottom) which includes the 5′ and 3′ ends of the sequence. The Mfold User Manual defines the exterior loop as such:

The collection of bases and base pairs not accessible from any base pair is called the exterior (or external) loop … . It is worth noting that if we imagine adding a 0th and an (n + 1)st base to the RNA, and a base pair 0.(n+1), then the exterior loop becomes the loop closed by this imaginary base pair. … The exterior loop exists only in linear RNA.

While each of the other loops (hairpin, bulge, internal or junction) forms a closed ‘circle’ with two neighboring bases connected by either a canonical pair or backbone covalent bond, the ‘exterior loop’ has only an imaginary pair to close the 5′ and 3′ ends of the sequence. Moreover, the two ends of an RNA molecule are not necessarily close in three-dimenional space, as may be implied in the above secondary structure diagram. For example, in the H-type pseudoknotted structure 1ymo from human telomerase RNA, the 5′ and 3′ ends are on the opposite sides.

DSSR does not has the concept of an ‘exterior loop’ due to its lack of a closing pair to form a ‘circle’. Instead, each of the 5′ and 3′ dangling ends is taken as a ‘non-loop single-stranded segment’, if applicable. For the crystal structure of yeast phenylalanine tRNA (1ehz, see the figure at the bottom), the relevant portion of DSSR output is as below. Note that since the 5′ end is paired, only the single-stranded region at the 3′ end is listed. Presumably, the ‘exterior loop’ in this case would also include the G1—C72 pair, with the imaginary closing pair connecting G1 and A76.

List of 1 non-loop single-stranded segment
   1 nts=4 ACCA A.A73,A.C74,A.C75,A.A76

yeast phenylalanine tRNA



DSSR-derived DBN for an input entry with multiple RNA molecules

Dot bracket notation (dbn) is a popular format to represent RNA secondary structures. Initially introduced by the ViennaRNA package, dbn uses dots (.) for unpaired bases, and matched parentheses () for the canonical Watson-Crick A-T and G-C or the wobble G-U pairs. This compact representation was designed for fully nested (i.e., pseudoknot free) RNA secondary structures in a single RNA molecule. Over the years, it has been extended to cover pseudoknots (of possibly higher orders) using matched pairs of [], {}, and <> etc.

To derive dbn from three-dimensional atomic coordinates with DSSR, I was faced with an issue on how to represent multiple RNA chains (molecules). A closely related yet practical problem is chain breaks, as in x-ray crystal structures where disordered regions may not have fitted coordinates. I searched but failed to find any ‘standard’ way to account for chain breaks or multiple molecules in dbn. The commonly used programs for visualizing RNA secondary structure diagrams that I tested at that time did not take such cases into consideration — they simply showed all bases as if they were from a single continous RNA chain.

I discussed the issue with Dr. Yann Ponty, the maintainer of the popular VARNA program. After a few around of email exchanges, we introduced an extra symbol (&) in both sequence and dbn to designate multiple chains or breaks within a chain to communicate between DSSR and VARNA.

As an example, the DSSR-derived dbn for the double-stranded DNA structure 355d (the famous Dickerson dodecamer) is as below:

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>355d nts=24 [whole]
>355d-A #1 nts=12 [chain] DNA
>355d-B #2 nts=12 [chain] DNA

As another example, the PDB entry 2fk6 contains a tRNA with chain breaks — nucleotides 26 to 45 are missing from the structure (see figure below). The DSSR-derived dbn is as follows — note the * at the end of the header line.

>2fk6-R #1 nts=53 [chain] RNA*

2FK6: RNAse Z/tRNA(Thr) complex with chain break

It is worth mentioning a subtle point in DSSR-derived dbn with multiple chains, i.e., the order of the chains may make a difference! The point is best illustrated with a concrete example — here, 4un3, the crystal structure of Cas9 bound to PAM-containing DNA target. Based on the data file downloaded directly from the PDB (4un3.pdb), the relevant portions of DSSR output are:

Special notes:
   o Cross-paired segments in separate chains, be *careful* with .dbn

This structure contains *1-order pseudoknot
   o You may want to run DSSR again with the '--nested' option which removes
     pseudoknots to get a fully nested secondary structure representation.
   o The DSSR-derived dbn may be problematic (see notes above).

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>4un3 nts=120 [whole]
>4un3-A #1 nts=81 [chain] RNA
>4un3-C #2 nts=28 [chain] DNA
>4un3-D #3 nts=11 [chain] DNA

The notes in the DSSR output is worth paying attention to. Specifically, it reports a “*1-order pseudoknot” — note also the *! Here the target DNA chain C comes before DNA chain D in the PDB file. The 5′-end bases in chain C pair with bases in D, and the 3′-end bases in C pair with RNA bases in chain A. There exist pairs crossing along the ‘linear’ sequence position-wise, hence the reported “pseudoknot”. However, simply reverse DNA chains C and D, i.e., moving chain D before C (in file 4un3-ADC.pdb), the “pseudoknot” will be gone, as shown below:

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>4un3-ADC nts=120 [whole]
>4un3-ADC-A #1 nts=81 [chain] RNA
>4un3-ADC-D #2 nts=11 [chain] DNA
>4un3-ADC-C #3 nts=28 [chain] DNA

Notes added on March 19, 2015

  • It has drawn to my attention that the NUPACK uses ‘+’ instead of ‘&’ as the symbol to separate multiple chains (or chain breaks). In fact, DSSR has an undocumented option --dbn_break which can be set to any of the character in the string &.:,|+. The ‘&’ symbol was chosen for communication with VARNA which requires ‘&’, at least up to now. This is an excellent example showing the efforts that I have put into the little details while developing DSSR.
  • The issue on proper ordering of multiple chains to avoid crossing lines (false pseudoknots) has been formally addressed by Dirks et al. in their 2007 article titled Thermodynamic analysis of interacting nucleic acid strands (SIAM Rev, 49, 65-88), specifically in Section 2.1 (Fig. 2.1). Applying that algorithm to nucleic acid structures, however, is beyond the scope of DSSR. The program strictly respects the ordering of chains and nucleotides within a given PDB or PDBx/mmCIF file, but outputs warning messages where necessary to draw users’ attention. As another example, I’ve recently noticed that DNA duplexes produced by Maestro (a product of Schrödinger) list nucleotides of the complementary strand in 3′ to 5′ order to match the 5′ to 3′ directionality of the leading strand for each Watson-Crick pair (See below).
Special notes:
   o nucleotides out of order

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>ga62_ca62_1m_in nts=24 [whole]
>ga62_ca62_1m_in-1-A #1 nts=12 [chain] DNA
>ga62_ca62_1m_in-1-B #2 nts=12 [chain] DNA



The Biophysical Society (BPS) 59th annual meeting at Baltimore

I’m going to attend the Biophysical Society (BPS) 59th Annual Meeting to be held during February 7-11 at Baltimore, Maryland. In last year’s BPS annual meeting (San Francisco, California), I was delighted to come across a few 3DNA users at poster sessions. I thought this post may help to connect me with some DSSR/3DNA users in the coming meeting.

Want to have a meetup at Baltimore? Please drop me a message!



Weird atom names of ligand thiamine pyrophosphate (TPP)

Recently I came across the ligand thiamine pyrophosphate (TPP) in some RNA riboswitch structures. I was a bit surprised by the atom names adopted for the ligand by the PDB. See figures below for the chemical structure of TPP from the RCSB PDB website (first), and the three-dimensional structure of the ligand from the riboswitch 2gdi (second).

Chemical structure of ligand thiamine pyrophosphate

Ligand thiamine pyrophosphate in PDB entry 2gdi

Specifically, the planar base-like moiety at the right has atom names ending with prime. To my knowledge, only sugar atom names of DNA and RNA nucleotides have the prime suffix, such as the 2′-hydroxyl group in RNA.

The RCSB webpage for TPP shows that currently there are 107 entries in the PDB, among which 100 are from proteins, 6 from RNA, and one in a RNA-protein complex. It is not clear to me whether the prime-bearing names in TPP are following any documented ‘standard’ or convention. DSSR is nevertheless taking a note of such ‘weird’ cases.



The 3DNA Forum registered users have reached 2000

As of today, the number of registered users on the 3DNA Forum has reached 2000. Over the past three years, the annual average of resignations is 650, corresponding to approximately 1.8 per day. While many registrations use free email services (gmail, hotmail or yahoo, etc), a significant portion (especially more recent ones) employs their job email (e.g., .edu). This is clear sign of increasing trust the community puts in the Forum.

To ensure the 3DNA Forum spam-free, I’ve adhered a zero-tolerance policy of any trolling or suspicion activities. The anti-spam software has played a big role in making this clean status feasible, as is evident from the note: “120,933 Spammers blocked up until today”.

From a scientific perspective, all posted questions have been addressed promptly, normally within hours. Instead of feeling like a burden, maintaining the Forum and answering user questions have been a pleasure. I’d love to see more questions or posts on the Forum.



Characterization of H-type pseudoknots with DSSR

The v1.2.1 (2015feb01) release of DSSR contains a new functionality to characterize the so-called H-type pseudoknots. In this classical and most common type of pseudoknots, nucleotides from a hairpin loop form Watson-Crick base pairs with a single-stranded region outside of the hairpin to create another (adjacent) stem, as shown in the following illustration (taken from the Huang et al. paper A heuristic approach for detecting RNA H-type pseudoknots).

Schematic diagram the H-type pseudoknot

Normally, L2 is absent (i.e., with zero nucleotides) due to direct coaxial stacking of the two stems. An example output of DSSR on 1ymo (a human telomerase RNA pseudoknot) is shown below:

3D and secondary structures of an H-type pseudoknot (1ymo)

The corresponding sections from DSSR output are:

List of 3 H-type pseudoknot loop segments
   1 stem#1(hairpin#1) vs stem#2(hairpin#2) L1 groove=MAJOR nts=8 UUUUUCUC U7,U8,U9,U10,U11,C12,U13,C14
   2 stem#1(hairpin#1) vs stem#2(hairpin#2) L2 groove=----- nts=0
   3 stem#1(hairpin#1) vs stem#2(hairpin#2) L3 groove=minor nts=8 CAAACAAA C30,A31,A32,A33,C34,A35,A36,A37

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>1ymo-1-A #1 nts=47 [chain] RNA

Checking against the three-dimensional image and the secondary structure in linear form shown above, the meaning of the new section should be obvious. If you want to see more details, click the link to the DSSR-output file on 1ymo.



Two more citations to DSSR

Recently I came across the following two citations to DSSR:

Base pair types were annotated with RNAview (45,46). Hydrogen bonds were annotated manually and with the help of DSSR of the 3DNA package (47,48). Helix parameters were obtained using the Curves+ web server (49). Structural figures were prepared using PyMol (50).

It is interesting to note that DSSR is cited here for its identification of hydrogen bonds, not its annotation of base pairs, among many other features. The simple geometry-based H-bonding identification algorithm, originally implemented in find_pair/analyze of 3DNA (and adopted by RNAView) and highly refined in DSSR, works well for nucleic acid structures. With the --get-hbonds option, users can now use DSSR as a tool just for its list of H-bonds outside of the program.

All figures were generated using PyMOL (60) or Chimera (48). The secondary structure diagram of the human mitoribosomal RNA was prepared by extracting base pairs from the model using DSSR (61). The secondary structure diagram was drawn in VARNA (62) and finalized in Inkscape.

I am very pleased to see that DSSR was cited for its ‘intended’ use in this important piece of work from a leading laboratory in structural biology. In the middle of last November (2013), I was approached by the lead author for proper citation of DSSR, and I suggested the two 3DNA papers. As far as I can remember, this was the first time I received such a question on DSSR citation. It prompted to write a FAQ entry in the DSSR User Manual, titled “How to cite DSSR?”. Hopefully, this citation issue will be gone in the near future.

Over the past two years, I’ve devoted significant efforts to make DSSR a handy tool for RNA structural bioinformatics; it certainly represents my view as to what a scientific software program should be like. As time passes by, DSSR is becoming increasingly sophisticated and citations to DSSR can only be higher.



Processing large structures in mmCIF format

Recently, PDB begins to release atomic coordinates of large (ribosomal) structures in mmCIF format. For nucleic-acid-containing structures, the largest one so far is 4v4g, the crystal structure of five 70S ribosomes from Escherichia coli in complex with protein Y. It is assembled from ten PDB entries (1voq, 1vor, 1vos, 1vou, 1vov, 1vow, 1vox, 1voy, 1voz, 1vp0), consisting of 22,345 nucleotides, and a total of 717,805 atoms.

This humongous structure poses no problems to DSSR at all, as shown below.

Command: x3dna-dssr -i=4v4g.cif -o=4v4g.out
Processing file '4v4g.cif' [4v4g]

total number of base pairs: 9277
total number of multiplets: 918
total number of helices: 1099
total number of stems: 1221
total number of isolated WC/wobble pairs: 603
total number of atom-base stacking interactions: 1736
total number of hairpin loops: 504
total number of bulges: 170
total number of internal loops: 775
total number of junctions: 214
total number of non-loop single-stranded segments: 429
total number of kissing loops: 5
total number of A-minor (type I and II) motifs: 100
total number of ribose zippers: 58 (1159)
total number of kink turns: 39

Time used: 00:00:10:45

It took less than 11 minutes to run on an iMac (and nearly 14 minutes on a Ubuntu Linux machine). Given the



DNA/RNA molecular dynamics trajectory analysis with do_x3dna

With great pleasure, I read the following annoancement from Rajendra Kumar on the 3DNA Forum:

Re: do_x3dna: a tool to analyze DNA/RNA in molecular dynamics trajectories 
« Reply #1 on: Today at 10:53:31 AM »


I have now made a new website for do_x3dna
( This website contains detailed
documentation for do_x3dna program and Python APIs.

Documentation for Python API is now available

Few tutorials about the Python APIs are also now available


With best regards,

Browsing through the do_x3dna website, I am impressed by the extensive documentation and tutorial. Clearly, do_x3dna has pushed the boundaries (in applicability and documentation) of the x3dna_ensemble Ruby script distributed with 3DNA v2.1.

As noted in GitHub page, do_x3dna has been developed to analyze fluctuations in DNA or RNA structures in molecular dynamics (MD) trajectories. It can be used for GROMACS MD trajectories, as well as those from NAMD and AMBER. It leaves no doubt that do_x3dna will boost 3DNA’s applications in the increasingly active field of DNA/RNA MD simulations.

Comment [2]


List of modified nucleotides in DSSR output

From early on, 3DNA and DSSR have native support of modified nucleotides. The currently distributed baselist.dat file with 3DNA contains over 700 entries. As of v1.1.4-2014aug09, a new section has been added to DSSR to list explicitly the modified nucleotides in an analyzed structure.

Using the 76-nucleotide long yeast phenylalanine tRNA (1ehz) as an example, the pertinent section in DSSR output is as below.

List of 11 types of 14 modified nucleotides
      nt    count  list
   1 1MA-a    1    A.1MA58
   2 2MG-g    1    A.2MG10
   3 5MC-c    2    A.5MC40,A.5MC49
   4 5MU-t    1    A.5MU54
   5 7MG-g    1    A.7MG46
   6 H2U-u    2    A.H2U16,A.H2U17
   7 M2G-g    1    A.M2G26
   8 OMC-c    1    A.OMC32
   9 OMG-g    1    A.OMG34
  10 PSU-P    2    A.PSU39,A.PSU55
  11 YYG-g    1    A.YYG37

So 1ehz has 14 modified nucleotides of 11 different type, as listed in the following rows after the header line. The meaning of each column should be obvious. For example, the third row means that 5MC (5-methylcytidine, abbreviated as 'c' in 1-letter code) occurs twice, identified as A.5MC40 and A.5MC49, respectively.

With the 3-letter id, one can search the RCSB ligand database for more information about a specified modified nucleotide. The URL would be like this, using pseudouridine (PSU) as an example,

It is hoped that the newly added section, put at the very top of DSSR output, will draw more attention to modified nucleotides.



DSSR-derived secondary structure in BPSEQ format

From v1.1.3-2014jun18, DSSR has an additional output of RNA secondary structures in BPSEQ format. A sample file for PDB entry 1msy is shown below.

1msy [GUAA tetra loop] in 3d and 2d representations

Filename: dssr-2ndstrs.bpseq
Organism: DSSR-derived secondary structure [1msy]
Accession Number: DSSR v1.1.4-2014aug09 (
Citation: Please cite 3DNA/DSSR (see
    1 U     0 # name=A.U2647
    2 G    26 # name=A.G2648, pairedNt=A.U2672
    3 C    25 # name=A.C2649, pairedNt=A.G2671
    4 U    24 # name=A.U2650, pairedNt=A.A2670
    5 C    23 # name=A.C2651, pairedNt=A.G2669
    6 C    22 # name=A.C2652, pairedNt=A.G2668
    7 U     0 # name=A.U2653
    8 A     0 # name=A.A2654
    9 G     0 # name=A.G2655
   10 U     0 # name=A.U2656
   11 A     0 # name=A.A2657
   12 C    17 # name=A.C2658, pairedNt=A.G2663
   13 G     0 # name=A.G2659
   14 U     0 # name=A.U2660
   15 A     0 # name=A.A2661
   16 A     0 # name=A.A2662
   17 G    12 # name=A.G2663, pairedNt=A.C2658
   18 G     0 # name=A.G2664
   19 A     0 # name=A.A2665
   20 C     0 # name=A.C2666
   21 C     0 # name=A.C2667
   22 G     6 # name=A.G2668, pairedNt=A.C2652
   23 G     5 # name=A.G2669, pairedNt=A.C2651
   24 A     4 # name=A.A2670, pairedNt=A.U2650
   25 G     3 # name=A.G2671, pairedNt=A.C2649
   26 U     2 # name=A.U2672, pairedNt=A.G2648
   27 G     0 # name=A.G2673

Based on online sources, BPSEQ has originated from the Comparative RNA Web site developed by the Gutell lab. CRW files contain four header lines, describing the file name, organism, accession number, and a general remark. Thereafter, there is one line per base in the molecule, listing the position of the base (starting from 1), the one-letter base name (A,C,G,U etc), and the position number of the base to which it is paired. If the base is unpaired, zero (0) is put in the third column. In the above sample BPSEQ file derived from DSSR, detailed information about the base and its paired base (if any) comes after the # symbol.

Compared to dot-bracket notation (dbn) and connect-table (.ct) format, BPSEQ is simpler but less expressive. Nevertheless, the format is well-supported in bioinformatic tools on RNA secondary structures. It only seems fitting that DSSR now produces secondary structures in .bpseq (with default file name dssr-2ndstrs.bpseq), in addition to .dbn and .ct. Technically, adding the BPSEQ output to DSSR is trivial given the infrastructure already in place.



RNA pseudoknot detection and removal with DSSR

From early on, DSSR-derived RNA secondary structures in dot-bracket notation (dbn) have taken pseudoknots into consideration. Nevertheless, in DSSR releases prior to v1.1.3-2014jun18, the dbn output had been simplified to the first level only, with matched []s, even for RNA structures with high-order pseudoknots. RNA pseudoknot is a (relatively) complicated issue, and I’d planned to put off the topic until DSSR is well-established.

In early May, I noticed the Antczak et al. article RNApdbee—a webserver to derive secondary structures from pdb files of knotted and unknotted RNAs. I was delighted to read the following citation:

In order to facilitate a more comprehensive study, the webserver integrates the functionality of RNAView, MC-Annotate and 3DNA/DSSR, being the most common tools used for automated identification and classification of RNA base pairs.

Even before any paper on DSSR has been published, the software has already be ranked in the top three for the identification and classification of RNA base pairs! Well familiar with RNAView and MC-Annotate, I am glad to see DSSR is now listed on a par with them. Note that DSSR has far more functionality than just identifying and classifying RNA base pairs.

Further down the RNApdbee paper, especially in Figure 2, I found the following remarks regarding DSSR’s capability on RNA structures with high-order pseudoknot.

An arc diagram to represent the secondary structure of 1DDY (chain A)

An arc diagram to represent the secondary structure of 1DDY (chain A) generated by R-CHIE upon the dot-bracket notation. Arcs of the same colour define a paired region. Crossing arcs reflect a conflict observed between the corresponding regions. (a) RNApdbee recognizes pseudoknots of the first (dark green) and second (navy blue) order. (b) 3DNA/DSSR improperly classifies base pairs (within residues in red) and the structure is recognized as the first-order pseudoknot.

The above citation and the question Higher-order pseudoknots in DP output (from Jan Hajic, Charles University in Prague) on the 3DNA Forum prompted me to further refine DSSR’s algorithm for deriving secondary structures of RNA with high-order pseudoknots. The DSSR v1.1.3-2014jun18 release made this revised functionality explicit. For the above cited PDB entry 1ddy, the relevant output of running DSSR on it would be:

Running command: "x3dna-dssr -i=1ddy.pdb"

This structure contains 2-order pseudoknot(s)

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>1ddy nts=140 [whole]
>1ddy-A #1 nts=35 [chain] RNA
>1ddy-C #2 nts=35 [chain] RNA
>1ddy-E #3 nts=35 [chain] RNA
>1ddy-G #4 nts=35 [chain] RNA

Note that the whole 1ddy entry contains four RNA chains (A, C, E, and G), and DSSR can handle each properly. So at least from DSSR v1.1.3-2014jun18, the following statement is no longer valid:

3DNA/DSSR improperly classifies base pairs (within residues in red) and the structure is recognized as the first-order pseudoknot.

A closely related issue is knot removal, a topic nicely summarized by Smit et al. in their publication From knotted to nested RNA structures: A variety of computational methods for pseudoknot removal. While not explicitly documented, the --nested (abbreviated to --nest) option has been available since DSSR v1.1.3-2014jun18. This option was first mentioned in the release note of DSSR v1.1.4-2014aug09. Again, using PDB entry 1ddy as an example, the relevant output of running DSSR with option --nested is as follows:

Running command: "x3dna-dssr -i=1ddy.pdb --nested"

This structure contains 2-order pseudoknot(s)
   o You've chosen to remove pseudo-knots, leaving only nested pairs

Secondary structures in dot-bracket notation (dbn) as a whole and per chain
>1ddy nts=140 [whole]
>1ddy-A #1 nts=35 [chain] RNA
>1ddy-C #2 nts=35 [chain] RNA
>1ddy-E #3 nts=35 [chain] RNA
>1ddy-G #4 nts=35 [chain] RNA



Get hydrogen bonds with DSSR

H-bonding interactions are crucial for defining RNA secondary and tertiary structures. DSSR/3DNA contains a geometrically based algorithm for identifying H-bonds in nucleic-acid or protein structures given in .pdb or .cif format. Over the years, the method has been continuously refined, and it has served its purpose quite well. As of v1.1.1-2014apr11, this functionality is directly available from DSSR thorough the --get-hbonds option.

The output for 1msy, which contains a GUAA tetraloop mutant of Sarcin/Ricin domain from E. Coli 23 S rRNA, is listed below. The first line gives the header (# H-bonds in '1msy.pdb' identified by DSSR ...). The second line provides the total number of H-bonds (40) identified in the structure. Afterwards, each line consists of 8 space-delimited columns used to characterize a specific H-bond. Using the first one (#1) as an example, the meaning of each of the 8 columns is:

  1. The serial number (15), as denoted in the .pdb or .cif file, of the first atom of the H-bond.
  2. The serial number (578) of the second H-bond atom.
  3. The H-bond index (#1), from 1 to the total number of H-bonds.
  4. A one-letter symbol showing the atom-pair type (p) of the H-bond. It is ‘p’ for a donor-acceptor atom pair; ‘o’ for a donor/acceptor (such as the 2′-hydorxyl oxygen) with any other atom; ‘x’ for a donor-donor or acceptor-acceptor pair (as in #17); ‘?’ if the donor/acceptor status is unknown for any H-bond atom.
  5. Distance in Å between donor/acceptor atoms (2.768).
  6. Elemental symbols of the two atoms involved in the H-bond (O/N).
  7. Identifier of the first H-bond atom (O4@A.U2647).
  8. Identifier of the second H-bond atom (N1@A.G2673).
Command: x3dna-dssr -i=1msy.pdb --get-hbonds –o=1msy-hbonds.txt

# H-bonds in '1msy.pdb' identified by 3DNA version 3 (
   15   578  #1     p    2.768 O:N O4@A.U2647 N1@A.G2673
   35   555  #2     p    2.776 O:N O6@A.G2648 N3@A.U2672
   36   554  #3     p    2.826 N:O N1@A.G2648 O2@A.U2672
   55   537  #4     p    2.965 O:N O2@A.C2649 N2@A.G2671
   56   535  #5     p    2.836 N:N N3@A.C2649 N1@A.G2671
   58   534  #6     p    2.769 N:O N4@A.C2649 O6@A.G2671
   76   513  #7     p    2.806 N:N N3@A.U2650 N1@A.A2670
   78   512  #8     p    3.129 O:N O4@A.U2650 N6@A.A2670
   95   492  #9     p    2.703 O:N O2@A.C2651 N2@A.G2669
   96   490  #10    p    2.853 N:N N3@A.C2651 N1@A.G2669
   98   489  #11    p    2.987 N:O N4@A.C2651 O6@A.G2669
  115   466  #12    p    2.817 O:N O2@A.C2652 N2@A.G2668
  116   464  #13    p    2.907 N:N N3@A.C2652 N1@A.G2668
  118   463  #14    p    2.897 N:O N4@A.C2652 O6@A.G2668
  123   151  #15    o    2.622 O:O OP2@A.U2653 O2'@A.A2654
  135   443  #16    p    2.898 O:N O2@A.U2653 N4@A.C2667
  147   192  #17    x    3.054 O:O O4'@A.A2654 O4'@A.U2656
  158   408  #18    p    2.960 N:O N6@A.A2654 OP2@A.C2666
  173   188  #19    o    2.923 O:O O2'@A.G2655 OP2@A.U2656
  173   378  #20    o    3.093 O:O O2'@A.G2655 O6@A.G2664
  173   379  #21    o    3.343 O:N O2'@A.G2655 N1@A.G2664
  181   386  #22    p    2.768 N:O N1@A.G2655 OP2@A.A2665
  183   203  #23    p    2.754 N:O N2@A.G2655 O4@A.U2656
  183   387  #24    p    2.887 N:O N2@A.G2655 O5'@A.A2665
  188   379  #25    p    3.044 O:N OP2@A.U2656 N1@A.G2664
  188   381  #26    p    2.944 O:N OP2@A.U2656 N2@A.G2664
  200   401  #27    p    3.122 O:N O2@A.U2656 N6@A.A2665
  201   398  #28    p    2.759 N:N N3@A.U2656 N7@A.A2665
  220   381  #29    p    3.035 N:N N7@A.A2657 N2@A.G2664
  223   371  #30    o    2.963 N:O N6@A.A2657 O2'@A.G2664
  223   382  #31    p    3.039 N:N N6@A.A2657 N3@A.G2664
  242   358  #32    p    2.821 O:N O2@A.C2658 N2@A.G2663
  243   356  #33    p    2.890 N:N N3@A.C2658 N1@A.G2663
  245   355  #34    p    2.887 N:O N4@A.C2658 O6@A.G2663
  258   305  #35    o    2.604 O:N O2'@A.G2659 N7@A.A2661
  258   308  #36    o    3.264 O:N O2'@A.G2659 N6@A.A2661
  268   315  #37    p    2.973 N:O N2@A.G2659 OP2@A.A2662
  268   327  #38    p    2.864 N:N N2@A.G2659 N7@A.A2662
  371   390  #39    o    2.751 O:O O2'@A.G2664 O4'@A.A2665
  550   566  #40    o    3.372 O:O O2'@A.U2672 O4'@A.G2673

In its default settings, DSSR detects 117 H-bonds for 1ehz (yeast phenylalanine tRNA), and 5,809 for 1jj2 (the H. marismortui large ribosomal subunit). Note that the program can identify H-bonds not only in RNA and DNA, but also in proteins, or their complexes. By default, however, DSSR only reports H-bonds within nucleic acids. As shown above, it is trivial to run DSSR with the --get-hbonds option to get all H-bonds in a given structure, and the plain text output is straightforward to work on.

While there exist dedicated tools for finding H-bonds, such as HBPLUS or HBexplore, DSSR may well be sufficient to fulfill most practical needs. If you notice any weird behaviors with this H-bond finding functionality, please let me know. I strive to address reported issues promptly, to the extent practical. At the very least, I should be able to explain why the program is working the way it does.

Comment [2]


DSSR for Windows, one executable fits all

From the very first release up until recently, the DSSR distribution had included two executables for Windows: one version was compiled on MinGW/MSYS, and the other on Cygwin. The executables are supposed to be run under the corresponding shells of the two environments respectively.

Since DSSR is a simple self-contained command-line tool, the MinGW/MSYS version also works directly under the Command Prompt of native Windows. So Windows users had the following three options to use DSSR:

  • Download the MinGW/MSYS version to run it under the Command Prompt of native Windows. No need to install MinGW/MSYS.
  • Download the MinGW/MSYS version to run it under the MinGW/MSYS environment, which must be installed separately.
  • Download the Cygwin version to run it under the Cygwin environment, which must be installed separately.

Over times, I have observed some confusions among DSSR users as to which of the two executables to use on Windows. Luckily, I noticed by chance recently that the DSSR executable compiled under MinGW/MSYS runs just fine in the Cygwin shell. So as of v1.1.0-2014apr09, the DSSR distribution contains only one executable for Windows: compiled under MinGW/MSYS on 32-bit Windows XP, the same DSSR executable runs under the Command Prompt of native Windows, MinGW/MSYS, and Cygwin, either on a 32-bit or 64-bit Windows (XP, Vista, 7 or 8) machine.

A size fits all: I no longer need to provide two compiled versions of DSSR for Windows, and users have just one executable to download (no more space for confusions).



Draw DSSR-derived RNA secondary structures in ct format

In addition to VARNA, the draw program in the RNAstructure package from the Mathews Laboratory can also be used to depict DSSR-derived RNA secondary structures in connect table (.ct) format. The draw program produces images in PostScript (or svg) format, in different styles from those generated by VARNA. Given below are a couple of examples on how to connect DSSR with draw.

The secondary structure of the PDB entry 1msy in DSSR-derived .ct file is as below:

   27 DSSR-derived secondary structure in '1msy'
    1 U     0     2     0  2647
    2 G     1     3    26  2648
    3 C     2     4    25  2649
    4 U     3     5    24  2650
    5 C     4     6    23  2651
    6 C     5     7    22  2652
    7 U     6     8     0  2653
    8 A     7     9     0  2654
    9 G     8    10     0  2655
   10 U     9    11     0  2656
   11 A    10    12     0  2657
   12 C    11    13    17  2658
   13 G    12    14     0  2659
   14 U    13    15     0  2660
   15 A    14    16     0  2661
   16 A    15    17     0  2662
   17 G    16    18    12  2663
   18 G    17    19     0  2664
   19 A    18    20     0  2665
   20 C    19    21     0  2666
   21 C    20    22     0  2667
   22 G    21    23     6  2668
   23 G    22    24     5  2669
   24 A    23    25     4  2670
   25 G    24    26     3  2671
   26 U    25    27     2  2672
   27 G    26     0     0  2673

Let the DSSR-derived .ct file for 1msy be named 1msy.ct, the following two draw-command runs will produce the secondary structure in PostScript (1msy.eps) and svg (1msy.svg) respectively.

draw 1msy.ct 1msy.eps
draw 1msy.ct 1msy.svg --svg -n 1

1msy [GUAA tetra loop] 2nd structure produced with the RNAstructure 'draw' program

The PDB entry 1ehz (yeast phenylalanine tRNA) has a pseudo knot, so the draw program will create a ‘circularized’ structure as shown below:

1ehz [yeast phenylalanine tRNA] 2nd structure produced with the RNAstructure 'draw' program

Note the following two caveats:



DSSR now has a user manual!

As of v1.0.3-2014mar09, DSSR has a decent user manual in PDF! Currently of 45 pages long, the DSSR manual contains everything a typical user needs to know to get started using the program effectively. The contents the manual are listed below.

Table of Contents

List of Figures


Download and installation

  Command-line help
  Default run on PDB entry 1msy – detailed explanations
    Summary section
    List of base pairs
    List of multiplets
    List of helices
    List of stems
    List of lone canonical pairs
    List of various loops
    List of single-stranded fragments
    Secondary structure in dot-bracket notation
    List of backbone torsion angles and suite names
  Default run on PDB entry 1ehz (tRNAPhe) – summary notes
    Brief summary
    Specific features
  Default run on PDB entry 1jj2 – four auto-checked motifs
    Kissing loops
    A-minor (types I and II) motifs
    Ribose zippers
    Kink turns
  The --more option
    Extra parameters for base pairs
    Extra parameters for helices/stems
  The –-non-pair option
  The –-u-turn option
  The --po4 option
  The –-long-idstr option

Frequently asked questions
  How to cite DSSR?
  Does DSSR work for DNA?
  Does DSSR detect RNA tertiary interactions?

Revision history



With the User Manual available, I feel confident to claim that DSSR is now mature, stable, ready for real world applications. While only time would tell, I have no doubt that DSSR will become an essential tool in RNA structural bioinformatics.



DSSR-derived secondary structure in .ct format

From early on, DSSR-derived nucleic acid secondary structures have been written in the compact dot-bracket notation (.dbn) with pseudo-knot information. To better connect DSSR to the 2D world, I recently looked into the connect (.ct) format, which was first introduced by Zuker’s mfold program. Over time, the .ct format has become one of the most commonly used RNA secondary structure formats, and it is more expressive than the .dbn format (see below).

As of v1.0, for each analyzed structure, DSSR produces two secondary structure files with default names dssr-2ndstrs.dbn and dssr-2ndstrs.ct, in .dbn and .ct formats, respectively. Using the 27-nucleotides (nt) RNA fragment 1msy as an example, the DSSR-derived secondary structure in .dbn and .ct formats are shown below:

1msy [GUAA tetra loop] in 3d and 2d representations

In dot-bracket notation (.dbn) [dssr-2ndstrs.dbn]
>1msy nts=27 DSSR-derived secondary structure

In connect format (.ct) [dssr-2ndstrs.ct]
   27 DSSR-derived secondary structure in '1msy'
    1 U     0     2     0  2647 # name=A.U2647
    2 G     1     3    26  2648 # name=A.G2648, pairedNt=A.U2672
    3 C     2     4    25  2649 # name=A.C2649, pairedNt=A.G2671
    4 U     3     5    24  2650 # name=A.U2650, pairedNt=A.A2670
    5 C     4     6    23  2651 # name=A.C2651, pairedNt=A.G2669
    6 C     5     7    22  2652 # name=A.C2652, pairedNt=A.G2668
    7 U     6     8     0  2653 # name=A.U2653
    8 A     7     9     0  2654 # name=A.A2654
    9 G     8    10     0  2655 # name=A.G2655
   10 U     9    11     0  2656 # name=A.U2656
   11 A    10    12     0  2657 # name=A.A2657
   12 C    11    13    17  2658 # name=A.C2658, pairedNt=A.G2663
   13 G    12    14     0  2659 # name=A.G2659
   14 U    13    15     0  2660 # name=A.U2660
   15 A    14    16     0  2661 # name=A.A2661
   16 A    15    17     0  2662 # name=A.A2662
   17 G    16    18    12  2663 # name=A.G2663, pairedNt=A.C2658
   18 G    17    19     0  2664 # name=A.G2664
   19 A    18    20     0  2665 # name=A.A2665
   20 C    19    21     0  2666 # name=A.C2666
   21 C    20    22     0  2667 # name=A.C2667
   22 G    21    23     6  2668 # name=A.G2668, pairedNt=A.C2652
   23 G    22    24     5  2669 # name=A.G2669, pairedNt=A.C2651
   24 A    23    25     4  2670 # name=A.A2670, pairedNt=A.U2650
   25 G    24    26     3  2671 # name=A.G2671, pairedNt=A.C2649
   26 U    25    27     2  2672 # name=A.U2672, pairedNt=A.G2648
   27 G    26     0     0  2673 # name=A.G2673

Presumably, the .ct format is very simple, and examining a sample file as shown above would give one a pretty good sense of what each column is about. While there exist many oversimplified descriptions of the .ct format on the web, the most detailed and accurate explanation is from the mfold manual:

The ``ct’‘ file (connect table) contains the sequence and base pair information, and is meant to be an input file for a structure drawing program. In addition to containing base pair information, it also lists the 5′ and 3′ neighbor of each base, allowing for the representation of circular RNA or multiple molecules. The ct file also lists the historical base numbering in the original sequence, as bases and base pairs are numbered according from 1 to the size of the folded segment. A portion of a ct file is displayed in Figure 12.

Figure 12: The ct file for the second and final folding of S. cerevisiae Phe-tRNA at 37°, with default parameters. The first record displays the fragment size (76), ΔG and sequence name. The ith subsequent record contains, in order, i, ri, the index of the 5′-connecting base, the index of the 3′-connecting base, the index of the paired base and the historical numbering of the ith base in the original sequence. The 5′, 3′ and base pair indices are 0 when there is no connection or base pair.

Specifically, the 3rd, 4th, and 6th columns in the .ct format convey specific information; by design, they are not redundant to information contained in the 1st column. Note that in the above ‘1msy’ example, the 6th column gives the nt sequence numbers (as in the PDB datafile) instead of the serial numbers (as in the 1st column). The DSSR produced .ct files also contain extra information after ‘#’, in the comma separated key=value format.

As an example of the usefulness of the 3rd and 4th columns, have a look of the DSSR-derived .ct file for the Dickerson DNA dodecamer duplex with sequence CGCGAATTCGCG:

   24 DSSR-derived secondary structure in '355d'
    1 C     0     2    24     1 # name=A.DC1, pairedNt=B.DG24
    2 G     1     3    23     2 # name=A.DG2, pairedNt=B.DC23
    3 C     2     4    22     3 # name=A.DC3, pairedNt=B.DG22
    4 G     3     5    21     4 # name=A.DG4, pairedNt=B.DC21
    5 A     4     6    20     5 # name=A.DA5, pairedNt=B.DT20
    6 A     5     7    19     6 # name=A.DA6, pairedNt=B.DT19
    7 T     6     8    18     7 # name=A.DT7, pairedNt=B.DA18
    8 T     7     9    17     8 # name=A.DT8, pairedNt=B.DA17
    9 C     8    10    16     9 # name=A.DC9, pairedNt=B.DG16
   10 G     9    11    15    10 # name=A.DG10, pairedNt=B.DC15
   11 C    10    12    14    11 # name=A.DC11, pairedNt=B.DG14
   12 G    11     0    13    12 # name=A.DG12, pairedNt=B.DC13
   13 C     0    14    12    13 # name=B.DC13, pairedNt=A.DG12
   14 G    13    15    11    14 # name=B.DG14, pairedNt=A.DC11
   15 C    14    16    10    15 # name=B.DC15, pairedNt=A.DG10
   16 G    15    17     9    16 # name=B.DG16, pairedNt=A.DC9
   17 A    16    18     8    17 # name=B.DA17, pairedNt=A.DT8
   18 A    17    19     7    18 # name=B.DA18, pairedNt=A.DT7
   19 T    18    20     6    19 # name=B.DT19, pairedNt=A.DA6
   20 T    19    21     5    20 # name=B.DT20, pairedNt=A.DA5
   21 C    20    22     4    21 # name=B.DC21, pairedNt=A.DG4
   22 G    21    23     3    22 # name=B.DG22, pairedNt=A.DC3
   23 C    22    24     2    23 # name=B.DC23, pairedNt=A.DG2
   24 G    23     0     1    24 # name=B.DG24, pairedNt=A.DC1

Note the 0 at the 4th column for A.DG12 which is at the 3′ end of chain A, and the 0 at 3rd column for B.DC13 which is at the 5′ end of chain B.



Single- and double-stranded Zp

From early on, 3DNA calculates the Zp parameter to separate A- and B-DNA double helical steps. First introduced in the paper A-form conformational motifs in ligand-bound DNA structures (see figure below), Zp is the mean projection of the two phosphorus atoms onto the z-axis of the dimer ‘middle frame’. Zp is greater than 1.5 Å for A-DNA, and it is less than 0.5 Å for B-DNA. As noted in the 3DNA NAR paper, other parameters such as slide should also be examined to confirm conformational assignments based on Zp.

definition of the Zp parameter for duplex DNA

As of v2.1, 3DNA has introduced the single-stranded variant for the Zp parameter (ssZp) as a more robust substitute for the Richardson phosphorus-glycosidic bond distance parameter (Dp) to characterize sugar puckers. See post Sugar pucker correlates with phosphorus-base distance for more details. In 3DNA/DSSR, ssZp is defined as the z-coordinate of the 3′ phosphorus atom expressed in the standard reference frame of the preceding base; it is positive when phosphorus lies on the +z-axis side (base in anti conformation) and negative if phosphorus is on the –z-axis side (base in syn conformation). Note that by definition, Dp should always be positive.

As in the previous post, here I am using G175 and U176 of PDB entry 1jj2 (the large ribosomal subunit of Haloarcula marismortui) as examples to illustrate how the ssZp parameters are calculated. The GpU forms a dinucleotide platform, where the sugar of G175 adopts a C2′-endo conformation, and that of U176 C3′-endo. For verification, here is the PDB data file for fragment 1jj2-G175-U176-A177.pdb (note A177 is included for its phosphorus atom). Run the following 3DNA commands:

find_pair -s 1jj2-G175-U176-A177.pdb stdout
frame_mol -1 ref_frames.dat 1jj2-G175-U176-A177.pdb ref-G175.pdb
frame_mol -2 ref_frames.dat 1jj2-G175-U176-A177.pdb ref-U176.pdb

File ref-G175.pdb contains the following line:

ATOM     24  P     U 0 176      -5.624   6.937   1.918  1.00 24.19           P 

The z-coordinate of U176 (which is 3′ to G175) is 1.918, which is the ssZp for G175. It is less than 2.9 Å, corresponding to the C2′-endo sugar conformation of G175.

Similarly, file ref-U176.pdb contains the following line:

ATOM     44  P     A 0 177      -3.841   6.592   4.377  1.00 25.91           P

So the ssZp for U176 is 4.377, which is greater than 2.9 Å, corresponding to the C3′-endo sugar conformation of U176.

To sum up, the double-stranded Zp as originally available from 3DNA can be used for discriminating A- and B-DNA double-helical steps: Zp > 1.5 Å for A-DNA, and Zp < 0.5 Å for B-DNA. The newly introduced single-stranded Zp is intended for characterizing sugar puckers: Zp > 2.9 Å for C3′-endo, and Zp < 2.9 Å for C2′-endo. Since A-DNA has predominately C3′-endo sugar conformation and B-DNA has C2′-endo sugar, the ssZp parameter would be helpful in classifying a dinucleotide into A- or B-like conformation. A survey of ssZp in well-defined A- and B-DNA structures (as performed for double-stranded Zp) should prove useful.

Realizing the naming confusions of double-stranded Zp vs single-stranded Zp, I am considering to rename single-stranded Zp as ssZp in future releases of 3DNA and DSSR. Do you have any comments or suggestions? Please let me know by leaving a comment!



Weird cases of nucleotides with missing atoms

Recently I was surprised by some cases of nucleotides with missing atoms in PDB entry 1pns. The story started like this: 3DNA/DSSR maps various nucleotide names to one-letter codes, based on the data file baselist.dat (see post Modified nucleotides in the PDB). In the meantime, 3DNA/DSSR internally assigns a nucleotide as either purine or pyrimidine, by virtue of coordinates of base atoms. Be definition, purines should only include A/a/G/g/I/i, and pyrimidines C/c/T/t/U/u/P/p. However, no consistency check has been implemented in DSSR until just now.

I first noticed the inconsistency between residue name and atom coordinates for nucleotide A6 on chain U (hereafter referred to as U.A6) in 1pns. The nucleotide has standard name ‘  A’, obviously a purine. However, somehow DSSR classified it as a pyrimidine based on atomic coordinates. Upon further check of the PDB data file, I found the following remarks:

REMARK 470 MISSING ATOM                                                         
REMARK 470 I=INSERTION CODE):                                                   
REMARK 470   M RES CSSEQI  ATOMS                                                
REMARK 470       A U   6    N9   C8   N7                                        
REMARK 470       G U   8    N9   C8   N7                                        
REMARK 470       A U  12    N9   C8   N7                                        
REMARK 470       A U  13    N9   C8   N7                                        
REMARK 470       A U  14    N9   C8   N7                                        

The atomic coordinates for U.A6 are as below:

ATOM  34447  P     A U   6      81.861  37.210  78.651  1.00378.87           P  
ATOM  34448  OP1   A U   6      80.631  37.121  77.831  1.00378.87           O  
ATOM  34449  OP2   A U   6      81.665  37.221  80.119  1.00378.87           O  
ATOM  34450  O5'   A U   6      82.707  38.495  78.212  1.00378.87           O  
ATOM  34451  C5'   A U   6      83.948  38.777  78.887  1.00378.87           C  
ATOM  34452  C4'   A U   6      84.600  40.000  78.276  1.00378.87           C  
ATOM  34453  O4'   A U   6      84.975  39.698  76.901  1.00378.87           O  
ATOM  34454  C3'   A U   6      83.714  41.239  78.153  1.00378.87           C  
ATOM  34455  O3'   A U   6      83.654  41.968  79.369  1.00378.87           O  
ATOM  34456  C2'   A U   6      84.403  42.015  77.020  1.00378.87           C  
ATOM  34457  O2'   A U   6      85.564  42.655  77.474  1.00378.87           O  
ATOM  34458  C1'   A U   6      84.834  40.864  76.105  1.00378.87           C  
ATOM  34459  C5    A U   6      82.033  39.296  74.209  1.00378.87           C  
ATOM  34460  C6    A U   6      82.941  39.553  75.166  1.00378.87           C  
ATOM  34461  N6    A U   6      81.170  39.949  72.090  1.00378.87           N  
ATOM  34462  N1    A U   6      83.830  40.588  75.041  1.00378.87           N  
ATOM  34463  C2    A U   6      83.843  41.410  73.939  1.00378.87           C  
ATOM  34464  N3    A U   6      82.899  41.124  72.974  1.00378.87           N  
ATOM  34465  C4    A U   6      81.968  40.108  73.016  1.00378.87           C  

No atom records for N7, C8 and N9. So far, so good. However, surprise came when I visualized U.A6 in Jmol, as shown in the following image. Note here atom N1 is connected to C1’ as in pyrimidines, and N6 is bonded to C4!

Weird U.A6 with missing atoms (1pns)

The same issue also exists for U.G8 (see figure below), U.A12, U.A13, and U.A14.

Weird U.G8 with missing atoms (1pns)

It is beyond my imagination to understand why such weird cases exist in the PDB, even given the lousy resolution (8.7 Å) of 1pns.



Three citations to 3DNA in the November 2013-41(21) issue of NAR

While browsing through the November 2013-41(21) issue of NAR, I am please to find the following three citations to 3DNA, all under the Section of ‘Structural Biology’.

Such citations illustrate the prominent status of 3DNA for DNA structural analysis. I firmly believe that DSSR will make 3DNA a top player for RNA structural analysis in the not-too-distant future.



3DNA/DSSR runs just fine under Mac OS X Mavericks

I recently upgraded my Macs to OS X Mavericks to check if 3DNA/DSSR works in the new operating system. I am glad to report that both run without a hitch, as expected.

Since OS X Mavericks is free from the Mac App Store, it will quickly become the de facto version virtually all Mac users would use. I also noticed that Ruby on Mavericks has been upgraded to ruby 2.0.0p247 (2013-06-27 revision 41674), a major step forward from the now retiring Ruby 1.8.7 distributed in previous versions of Mac OS X.

As a rule, I’d ensure that 3DNA/DSSR executes properly in major releases of the commonly used operating systems — Mac, Windows, and Linux.



DSSR works perfectly under DOS (in native Windows)

While having not used DOS for ages, I am glad to find that the DSSR version compiled for MinGW/MSYS on Windows works perfectly under this operating system (see screenshot below). The DSSR DOS command-line interface functions exactly the same as for Linux, Mac OS X, MinGW/MSYS, and CygWin. Among other possible usages, it allows for batch files to take advantage of DSSR.

Screenshot of a DSSR run in DOS

Implementing DSSR in strict ANSI C as a self-contained and zero-dependent command-line program pays off enormously: it simplifies code maintenance and ensures that the program is applicable wherever a C compiler exists. The easy web interface to DSSR makes the program universally accessible.



DSSR command-line processing

Aside from its extensive functionality for RNA structural analyses, DSSR also introduces a consistent and flexible way to process command-line options. Here, each option can be specified via a --key[=value] pair (or -key[=value] or key[=value]; i.e., two/one/zero preceding dashes are all accepted), key can be in either lower, UPPER or MiXed case, and value is optional for Boolean switches. Furthermore, options can be put in any order; if the same key is repeated more than once, the value specified last overwrites corresponding previous settings.

As always, the rules are best illustrated with concrete examples. Some typical use-cases are given below:

#1 analyze PDB entry '1msy', with default output to stdout
x3dna-dssr --input=1msy.pdb

#2 same as #1, with output directed to file '1msy.out'
x3dna-dssr --input=1msy.pdb --output=1msy.out

#3-6, same as #2
x3dna-dssr --output=1msy.out --input=1msy.pdb
x3dna-dssr --OUTPUT=1msy.out --Input=1msy.pdb
x3dna-dssr -output=1msy.out input=1msy.pdb
x3dna-dssr output=1msy.out --input=1msy.pdb

#7 the value '1ehz.pdb' overwrites '1msy.pdb'
x3dna-dssr --input=1msy.pdb input=1ehz.pdb

#8-12 with the switch --more set to true
x3dna-dssr -input=1msy.pdb --more
x3dna-dssr -input=1msy.pdb --more=true
x3dna-dssr -input=1msy.pdb --more=yes
x3dna-dssr -input=1msy.pdb --more=on
x3dna-dssr -input=1msy.pdb --more=1

#13 same as without specifying --more,
#      or with values set to false/no/0
x3dna-dssr -input=1msy.pdb --more=off

#14 shorthand forms for --input and --output
x3dna-dssr -i=1msy.pdb -o=1msy.out

#15 it can also be more verbose
x3dna-dssr --input-pdb-file=1msy.pdb

#16-18 within a key, separator dash(-) and underscore (_)
#      are treated the same, and can be omitted
x3dna-dssr -i=1msy.pdb -non-pair
x3dna-dssr -i=1msy.pdb -non_pair
x3dna-dssr -i=1msy.pdb -nonpair

By allowing for 2/1/0 dashes to precede each key and a dash/underscore character or none to separate words within the key, DSSR provides users with great flexibility in specifying command-line options to fit into their preferred styles. Not surprisingly, new programs to be added into 3DNA, or the version 3 release of the software will all follow the same convention.



Modified nucleotides in the PDB

In addition to the five canonical bases (A, C, G, T, and U), nucleic acid structures in the PDB contains numerous modified variants (natural or engineered) in the nucleobase, sugar, or the phosphate. For instance, the 76-nt (nucleotide) long yeast phenylalanine tRNA (1ehz) contains 14 modified bases: 2MG10, H2U16, H2U17, M2G26, OMC32, OMG34, YYG37, PSU39, 5MC40, 7MG46, 5MC49, 5MU54, PSU55, and 1MA58. Among which, the most prevalent and best-known example is pseudouridine. Note that in the PDB, each residue (including modified nt) is named with an up to three-letter identifier, e.g., PSU for pseudouridine. For a comprehensive list (with chemical and structural information) of small molecules, including modified nts, please refer to the Ligand Expo website hosted by the RCSB PDB.

Given the widespread occurrences of modified bases in nucleic acid structures, any practical structural bioinformatics software should be able to treat them effectively, as with the canonical bases. In 3DNA, from the very beginning, modified bases are mapped to standard counterparts, e.g. 5‐iodouracil (5IU) to uracil (U) and 1‐methyladenine (1MA) to adenine (A), allowing for easy analysis of unusual DNA and RNA structures (see the NAR03 reference). Specifically, in the 3DNA distribution the file baselist.dat contains the mappings explicitly.

As of v2.1, 3DNA automatically maps a new modified base not available in the file baselist.dat. Yet, I have continuously updated the list in line with new DNA/RNA entries released by the PDB. The process is automated with a Ruby script which calls find_pair -s on each nucleic-acid-containing structure to output unknown bases. As an extreme, the baselist.dat file below comprises only canonical bases:

  A   A
  C   C
  G   G
  T   T
  U   U
 DA   A
 DC   C
 DG   G
 DT   T

With the above minimum mapping list, running the command find_pair -s on 1ehz.pdb identifies all the 14 modified bases. A sample case for 2MG is shown below:

Match '2MG' to 'g' for residue 2MG   10  on chain A [#10]
    check it & consider to add line '2MG     g' to file <baselist.dat>

By parsing the output of a batch run on all DNA/RNA-containing entries in the PDB as of October 18, 2013, I identified a total of 596 modified bases. The top portion is as below:

02I     a
08Q     c
08T     a
0AD     g
 0C     c
0DC     c
0DG     g
0DT     t
 0G     g
0KL     u
0KX     c
0KZ     t

An explicit list of base mapping makes the correspondence transparent, and helps avoid ambiguous cases as to which canonical base a modified nt matches to. DSSR uses the same list internally. Hopefully, the information would also be useful to other related projects.

Comment [2]


Different names for the methyl group in DNA and RNA structures

Recently I was a bit surprised to find that the methyl group is named differently in the PDB: C7 in DT8 (thymine) of B-DNA 355d, CM5 in 5MC40 (5-methylated C) of tRNA 1ehz, and C5M in 5MU54 (5-methylated U, i.e., T) of the same tRNA 1ehz. See the three figures below for details.

I know that the previously named C5M of thymine in DNA has been renamed C7 as a result of the 2007 remediation effort (PDB v3). However, browsing through the wwPDB Remediation website and reading carefully the article Remediation of the protein data bank archive, I failed to see explanations of the obvious inconsistency of CM5 (5MC40) vs C5M (5MU54) in the nomenclature of the 5-methyl group in the same tRNA entry 1ehz, except for the following note:

As with the Chemical Component Dictionary, names for standard amino acids and nucleotides follow IUPAC recommendations (10) with the exception of the well-established convention for C-terminal atoms OXT and HXT. These nomenclature changes have been applied to standard polymeric chemical components only.

5-methyl is named C7 in DT8 of the DNA entry 355d

5-methyl in DT8 is named C7 in DNA (355d)

5-methyl is named CM5 in 5MC40 of the RNA entry 1ehz

5-methyl in 5MC40 is named CM5 in RNA (1ehz)

5-methyl is named C5M in 5MU54 of the RNA entry 1ehz

5-methyl in 5MU54 named C5M in RNA (1ehz)

Am I missing something obvious? If you have any further information, please leave a comment. Whatever the case, it helps (at least won’t hurt) to know the naming discrepancy for those who care about the small methyl group in nucleic acid structures.



Citations to the 3DNA homepage

Recently I came across the following two direct citations to the 3DNA homepage   3DNA: Suite of software programs for the analysis, rebuilding, and visualization of three-dimensional nucleic acid structures.

  • The review article Molecular Modeling of Nucleic Acid Structure by Galindo-Murillo et al. in Current Protocols in Nucleic Acid Chemistry in the section of “Model Building and Analysis Tools and Nucleic Acid Nomenclature” under INTERNET RESOURCES:
The 3DNA program for calculating helicoidal parameters in a consistent manner using a local helical axis definition.

As time goes by, I have every reason to believe that the website will become more noticeable in the literature. If you notice other such citations, please leave a comment.



Compiling ViennaRNA on Mac OS X

Recently, I upgraded my local ViennaRNA package installation from v2.0.7 to v2.1.3 on my Mac. Following Quickstart in the INSTALL file, I ran ./configure successfully, but make aborted with error messages. Since I previously had a working copy of the software, it must be configuration issues when I compiled this new version. After a few iterations of checking the error message and reading through the INSTALL file, I came up with the following settings:

./configure --disable-openmp --without-perl
sudo make install

Apart from some warning messages, the above make command ran successfully.

This post serves mainly as a note for my own reference. Hopefully, the information may prove useful to others who try to install the versatile ViennaRNA package on a Mac OS X machine.



Web-interface to DSSR

I’ve come up with a preliminary web-interface to DSSR, currently accessible at URL The DSSR web-interface has been tested on Safari, Firefox, Chrome, and IE, with satisfying results. A screenshot of the home page is given below, using 1msy as an example:

Screenshot of the Web-DSSR homepage

After clicking the Submit button, users will be presented with the result page of a DSSR run. The beginning portion of the above example is as follows:

Screenshot of a DSSR-run

Note that the DSSR web-interface is being provided via a shared web hosting service, thus it has limited resources. Specifically, the uploaded file cannot be larger than two megabytes (2MB), and the process could be slow. Additionally, the file must have an extension of .pdb or .cif. To take full advantage of what DSSR has to offer, please install and run the software locally.

By design, DSSR is self-contained, command-line driven, with zero dependance on third-party libraries. Such features make it straightforward to build a GUI- or web-interface to DSSR, or integrate the program into other structural bioinformatics tools. As the need arises, I will refine the DSSR web-interface to better serve the community. The current simple, yet exploratory, web interface should make DSSR accessible to a much wider audience.



UNR- and GNRA-type U-turns

As of beta-r20-on-20130830, DSSR is able to detect two types of U-turns (see the figure below), the UNR-type (left) originally identified by Quigley and Rich [1976] in yeast phenylalanine tRNA, and the GNRA-type (right) later on established by Jucker and Pardi [1995] in GNRA tetra loops. See the Gutell et al. paper Predicting U-turns in Ribosomal RNA with Comparative Sequence Analysis for a more extensive account of U-turns.

As its name implies, a U-turn is characterized by a reversal of the RNA backbone direction within a few nucleotides. Among other factors, the U-turn is stabilized by two key H-bonding interactions, illustrated in dotted lines in the figure below.

UNR-type U-turn in yeast phenylalanine tRNA (1ehz) GNRA-type U-turn in a GUAA tetraloop (1msy)
UNR-type (1ehz) GNRA-type (1msy)

Applying DSSR to 1jj2 (the crystal structure of the Haloarcula marismortui large ribosomal subunit) led to the identification of over 30 cases. In addition to the well-documented UNR- and GNRA-type U-turns, the program also finds other variants. An example is shown below, where the U-turn is within a GCA triloop instead of a GNRA tetraloop. Here, the N1 (not N2) atom of G1809 forms an H-bond with OP2 of G1812. The G1809 N2 atom is H-bonded to G1812 O5′ to further stabilize the U-turn.

U-turn formed by the GCA-triloop (1jj2)

An examination of the chemical structure of the nitrogenous bases (see figure below) shows clearly other possibilities to connect RNA base donors to the phosphate oxygen acceptors. DSSR allows for the exploration of such variations, and more.

Chemical diagram of RNA bases



Restraint optimization of DNA backbone geometry using PHENIX

3DNA can build DNA/RNA structures with a precise base but approximate sugar-phosphate backbone geometry. In the 2003 3DNA-NAR paper, Table 3 of the section “Structures built with sugar–phosphate backbone” lists “root mean square deviation (in Å) between rebuilt 3DNA models and experimental DNA structures” for three representative DNA structures (in A-form, B-form, and a protein-DNA complex). It was noted that The RMSD of reconstructed versus observed base positions is virtually zero and that for both base and backbone coordinates is <0.85 Å, even for the 146 bp nucleosomal DNA structure.

The backbone geometry is approximate because 3DNA uses a fixed sugar-phosphate conformation (in A-DNA, B-DNA or RNA) that is attached to the corresponding bases in the model building process. The most noticeable effect is the long O3′(i)···P(i+1) bond that connects consecutive nucleotides along a chain. The imprecise structure was intended as a starting point for other objectives (e.g., all-atom molecular dynamics simulations) that are out of the design scope of 3DNA. Nevertheless, over the years, I have been concerned with the overlong O3′—P distance issue. I tried but failed to find a satisfying third-party (command-line driven) tool that can perform restraint optimization of the sugar-phosphate backbone geometry while keeping base atoms fixed.

The problem was finally solved after I attended the 43rd Mid-Atlantic Macromolecular Crystallography Meeting held at Duke University a few months ago. At the meeting, I had the opportunities to talk to several members of the PHENIX team. Particularly, Jeff Headd revised the geometry_minimization component of PHENIX to do the trick. Here is the mail reply from Jeff, using a 3DNA-generated DNA duplex (355d-3dna.pdb) as an example (see full details below):

Here’s a first go at refining just the backbone atoms of you input DNA model. You’ll need the most recently nightly build of Phenix (dev-1395 would work) and then run:

phenix.geometry_minimization 355d-3dna.pdb min.params

using the attached min.params file.

What I specify in the params file is to only move the backbone atoms, which I’ve done with a selection. You can modify the atoms that are allowed to move to your liking.

The only other change was to allow longer distance linkages, as some of the backbone linkages start quite far apart.

The content of file min.params is:

pdb_interpretation {
  link_distance_cutoff = 7.0
selection = name " P  " or name " OP1" or name " OP2" or \
            name " O5'" or name " C5'" or name " C4'" or \
            name " O4'" or name " C3'" or name " O3'" or \
            name " C2'"

To make the story complete, given below is the step-by-step procedure, using 355d, a B-DNA dodecamer at 1.4 Å resolution as an example. The corresponding PDB file is named 355d.pdb.

find_pair 355d.pdb stdout | analyze stdin
x3dna_utils cp_std bdna
rebuild -atomic bp_step.par 355d-3dna.pdb
# the rebuilt structure is called '355d-3dna.pdb'

# with Phenix dev-1395 and above
phenix.geometry_minimization 355d-3dna.pdb min.params
# the optimized structure is called '355d-3dna_minimized.pdb'

# to verify:
find_pair 355d-3dna.pdb stdout | analyze stdin
find_pair 355d-3dna_minimized.pdb stdout | analyze stdin
# check files '355d-3dna.out' and '355d-3dna_minimized.out'

The three key files mentioned above are provided here for your verification:

Finally, the following figure illustrates the B-DNA dodecamer duplex in experimental (left), 3DNA-generated (middle) and PHENIX-optimized (right) coordinates. Note that disconnected O3′—P linkages (marked by red dots for two cases, see bottom of the middle image) due to overlong distances in 3DNA-rebuilt structure are fixed following the restraint PHENIX optimization.

355d-experimental 3DNA-rebuilt PHENIX-optimized
355d, experimental structure 3DNA-rebuilt structure PHENIX-optimized structure

Note added on 2016-11-11: In the min.params file, the selection is in one long line. For illustration purpose, the selection section (see below) is split into serveral short lines in the blog post. However, PHENIX requires ending backslashes (\) to combine the split lines into a single grammatical unit. I was not aware of this strict rule, and missed to add the ending \s in the original post. Thanks to Oleg Sobolev from the PHENIX team for pointing out this omission to my attention. Note that the content of min.params did not have a problem, and thus no change is made.

pdb_interpretation {
  link_distance_cutoff = 7.0
selection = name " P  " or name " OP1" or name " OP2" or \
            name " O5'" or name " C5'" or name " C4'" or \
            name " O4'" or name " C3'" or name " O3'" or \
            name " C2'"

Comment [4]


Detection of helical junctions in nucleic acid structures

One of DSSR’s noteworthy features is the auto-detection of helical junctions in nucleic acids structures, be it RNA, DNA, or chimeric DNA/RNA, consisting of one or multiple chains. Helical junctions are created at the interface of three and more stems composed of canonical pairs (Watson-Crick A—T/U and G—C, or wobble G—U). A three-way junction model is illustrated below (copied from Figure 1 of the Bindewald et al. RNAJunction paper). Note that the three chains are each continuous (i.e., consecutive nts are covalently connected), and together with the three inner bps, forming a loop in the middle. Here, the three-way junction is of type [3×2×3], and the loop is composed of a total of 3×2+3+2+3 = 14 nts.

definition of a three-way junction

DSSR automatically detects all existing helical junctions in a nucleic acid structure, as illustrated by the following examples.

1l6b [all DNA Holliday junction structure of d(CCGGTACm5CGG)]

This is a simple four-way junction of type [0×0×0×0], where all bases are paired, leaving no connecting nts. The related portion of DSSR output is:

List of 1 junction(s)
   1 4-way junctions: 8 nts; [0x0x0x0]; linked by [#1, #2, #4, #3]
       1:A.DA6+1:A.DC7+2:B.DG14+2:B.DT15+2:A.DA6+2:A.DC7+1:B.DG14+1:B.DT15 [ACGTACGT]
       0 nts junction ; 1:A.DA6-->1:A.DC7 [AC]
       0 nts junction ; 2:B.DG14-->2:B.DT15 [GT]
       0 nts junction ; 2:A.DA6-->2:A.DC7 [AC]
       0 nts junction ; 1:B.DG14-->1:B.DT15 [GT]

1L6B: all DNA Holliday junction

Technically, note the following points:

  • The four-way junction is derived from the biological assembly 1 (PDB file 1l6b.pdb1), which contains two copies of the asymmetric unit, delineated by MODEL/ENDMDL. By default, DSSR/3DNA works one structure at a time, corresponding to the first structure/model in a given PDB or mmCIF file. To take the biological assembly as a whole, and to avoid confusions with MODEL/ENDMDL delineated NMR entries, the ENDMDL record of the first model is commented out in the file (1l6b.pdb1), as below:

MODEL        2                                                                  
  • With the modified PDB file 1l6b.pdb1, the DSSR command can be run as x3dna-dssr -i=1l6b.pdb1, with the output going to stdout.
  • The simplified schematic block png image was generated with the command below to create the Raster3D .r3d file (1l6b.r3d), which was then ray-traced using PyMOL.
blocview -r 1l6b.r3d 1l6b.pdb1

1egk [a four-way DNA/RNA junction]

This four-way junction consists of both DNA and RNA chains. Here the helical junction may not be that obvious by directly looking at the 3D image.

List of 1 junction(s)
   1 4-way junctions: 10 nts; [0x0x1x1]; linked by [#3, #-1, #4, #5]
       B.DC37+B.DT38+B.DA45+B.DC46+C.G109+C.A110+C.U111+D.DA130+D.DG131+D.DG132 [CTACGAUAGG]
       0 nts junction ; B.DC37-->B.DT38 [CT]
       0 nts junction ; B.DA45-->B.DC46 [AC]
       1 nts junction C.A110 [A]; C.G109-->C.U111 [GAU]
       1 nts junction D.DG131 [G]; D.DA130-->D.DG132 [AGG]

1EGK: four-way DNA/RNA junction

1ehz [yeast phenylalanine tRNA]

As shown below, DSSR correctly detects the classic L-shaped 3D structure and the cloverleaf 2D structure of a tRNA.

List of 1 junction(s)
   1 4-way junctions: 16 nts; [2x1x5x0]; linked by [#1, #2, #3, #4]
       A.U7+A.U8+A.A9+A.2MG10+A.C25+A.M2G26+A.C27+A.G43+A.A44+A.G45+A.7MG46+A.U47+A.C48+A.5MC49+A.G65+A.A66 [UUAgCgCGAGgUCcGA]
       2 nts junction A.U8+A.A9 [UA]; A.U7-->A.2MG10 [UUAg]
       1 nts junction A.M2G26 [g]; A.C25-->A.C27 [CgC]
       5 nts junction A.A44+A.G45+A.7MG46+A.U47+A.C48 [AGgUC]; A.G43-->A.5MC49 [GAGgUCc]
       0 nts junction ; A.G65-->A.A66 [GA]

1EHZ: yeast phenylalanine tRNA

2fk6 [RNAse Z/tRNA(Thr) complex]

In a recent paper Predicting Helical Topologies in RNA Junctions as Tree Graphs by Laing et al., this PDB entry was selected in Table 1 as containing a three-way junction. However, DSSR fails to detect any junction in this structure, even though the program does find co-axial stacks. It turns out that the PDB entry 2fk6 does not possess the anti-codon stem/loop, thus nts C25 and G46 are not covalently connected. While three-way junctions may be defined differently, the DSSR result follows the above mentioned chain-continuity requirement.

2FK6: RNAse Z/tRNA(Thr) complex with chain break

Overall, DSSR can consistently find all helical junctions in a given nucleic acid structure. Try DSSR on a ribosomal structure, you may well appreciate what it reveals. Moreover, it is straightforward to apply the program to all RNA/DNA-containing entries in the PDB via a script.



Drawing an RNA secondary structure from its 3D coordinates

Given the primary sequence of an RNA molecule, there are numerous methods for predicting its secondary (2D) structures. To judge their accuracy, three-dimensional (3D) RNA structures solved experimentally by X-ray or NMR as deposited in the PDB are often used as benchmarks. DSSR is a handy tool to derive an RNA 2D structure from its 3D coordinates in PDB or mmCIF format. The 2D structure is specified in the dot-bracket notation (dbn), which can be fed directly into drawing programs such as VARNA for interactive display and easy generation of publication quality 2D diagrams.

Over the past few months, I’ve been asked a few times on the details of how the diagrams in the DSSR post were created. The answer is really simple, and has already been mentioned above and in the post. Here are two concrete examples to show how the process works.

1zc5 (structure of the RNA signal essential for translational frame shifting in HIV-1)

This is the structure used in the VARNA paper. Let the PDB file be named 1zc5.pdb, the DSSR program can be run like this:

x3dna-dssr -i=1zc5.pdb

The output is sent to stdout by default, with the following three lines towards the end:

>1zc5-A #1 RNA with 41 nts

Simply copy and paste the last two lines (sequence and the 2D structure in dbn notation) into the Seq: and Str: fields of the VARNA demo page, the diagram will be updated automatically, as shown in the screenshot:


1ehz (crystal structure of yeast phenylalanine tRNA at 1.93 Å resolution)

This example (1ehz.pdb) is used to illustrate tRNA’s classic cloverleaf 2D structure. The related command and result are:

x3dna-dssr -i=1ehz.pdb -o=1ehz.out

# the output is sent to file '1ehz.out'
# towards its end are the following 3 lines

>1ehz-A #1 RNA with 76 nts

I’ve used a local copy of the JAVA web start version of VARNA (VARNA-WebStart.jnlp) to generate the following 2D diagram. Here, in addition to the customized title, I have set the number period to 5 nts, adopted the simple base-pair style, and manually adjusted the T arm (upper right corner) to make the long line connecting G19 and C56 a bit more unobtrusive. Right-click to see the context menu.

Note that the G19—C56 pair creates a pseudo-knot (specified by the matching [] pair in the dbn notation above) in tRNA. I was not aware of this salient feature from previous knowledge of relevant literature. It was indeed a surprise when I first saw it in the 2D diagram.


As illustrated above, DSSR serves well as a bridge from RNA 3D to 2D structures. Give DSSR a try, you will find the program actually has much more to offer!



3DNA forum registrations pass 1000

As of June 24, 2013, the number of 3DNA Forum registrations has passed the 1000 mark. On September 16, 2012, I wrote the post The number of 3DNA forum registrations has reached 500. Thus, in slightly over 9 months, the number has doubled, with approximately 2 registrations per day.

I am glad to see the steady increase of the 3DNA user base. Over the time, I have strived to be responsive to user questions, and made every effort to keep the forum spam free. By and large, employing simple 3DNA-related questions has turned out to be an effective anti-spam strategy. Since the launch of the new in March 2012, I’ve received less than five requests (to the best of my memory) asking for help on registrations. As a recently example, a potential user got stuck with the question about what ‘w’ means in w3DNA. Based on user feedback, I have added hints to some questions to make their answers more obvious. Whatever the reasons, each reported issue has been promptly resolved.

With the release of DSSR and the continuous support of an enthusiastic user community, I have every reason to believe that 3DNA will gain more popularity in the years to come.



DSSR identifies kink-turns!

As of the beta-r14-on-20130626 release, DSSR has the functionality to identify kink-turns and reverse k-turns given an RNA structure in PDB format.

The k-turn motif was first described by Klein et al. (2001) in the paper The kink-turn: a new RNA secondary structure motif, based on analyses of the H. marismortui large ribosomal unit. It turns out to be a widespread structural motif, now with a dedicated k-turn database hosted by the Lilley laboratory.

Geometrically, k-turn is composed of an asymmetric internal loop, with a sharp kink between the two framing helices and characteristic loop features (including at least one sheared G-A pair and A-minor interactions). Overall, k-turn is a complicated motif, and I am not aware of any published method or available software for its auto-detection.

Previous releases of DSSR has built up all the necessary components to detect key features of a k-turn. Over the past few weeks, I have been focusing on connecting the dots to implement an algorithm for its auto-identification. As of beta-r14-on-20130626, DSSR can locate ‘simple’ k-turns or reverse k-turns from an RNA structure in PDB format. I understand the subtleties and variations of k-turns, and will refine the algorithm in future releases of DSSR.

Without putting k-turns under its umbrella, DSSR appears incomplete in its functionality. Hopefully, detection of k-turns will help DSSR gain more attention from the RNA structure community.



3DNA JoVE paper published

A new paper titled Analyzing and Building Nucleic Acid Structures with 3DNA has been published in JoVE (Journal of Visualized Experiments). Specifically, the article illustrates 3DNA’s unique capability to characterize and modify DNA structures at the level of the constituent base-pair steps, and highlights a new feature in v2.1 to analyze and align an ensemble of related structures determined with NMR or generated by MD simulations.

Here is the abstract:

The 3DNA software package is a popular and versatile bioinformatics tool with capabilities to analyze, construct, and visualize three-dimensional nucleic acid structures. This article presents detailed protocols for a subset of new and popular features available in 3DNA, applicable to both individual structures and ensembles of related structures. Protocol 1 lists the set of instructions needed to download and install the software. This is followed, in Protocol 2, by the analysis of a nucleic acid structure, including the assignment of base pairs and the determination of rigid-body parameters that describe the structure and, in Protocol 3, by a description of the reconstruction of an atomic model of a structure from its rigid-body parameters. The most recent version of 3DNA, version 2.1, has new features for the analysis and manipulation of ensembles of structures, such as those deduced from nuclear magnetic resonance (NMR) measurements and molecular dynamic (MD) simulations; these features are presented in Protocols 4 and 5. In addition to the 3DNA stand-alone software package, the w3DNA web server, located at, provides a user-friendly interface to selected features of the software. Protocol 6 demonstrates a novel feature of the site for building models of long DNA molecules decorated with bound proteins at user-specified locations.

A new section dedicated to the JoVE paper will be set up on the 3DNA Forum soon. It will contain all the data files and scripts so our published results can be strictly reproduced. The section should also serve as a platform for open discussions of related protocols.



DSSR, what is it and why bother?

Over the past six months or so1, I’ve been focusing mostly on developing DSSR, a new addition to the 3DNA suite of programs. So what is DSSR, specifically? Why did I bother to create it? How would it be relevant to the nucleic acid structure community?

Literally, DSSR stands for Defining the (Secondary) Structures of RNA2. Starting from an RNA structure in PDB format, DSSR employs a set of simple criteria to identify all existent base pairs (bp): both canonical Watson–Crick (WC) pairs and non-canonical pairs with at least one H-bond, made up of normal or modified bases, regardless of tautomeric or protonation state. The classification is based on the six standard rigid-body bp parameters (shear, stretch, stagger, propeller, buckle, and opening), which together rigorously quantify the spatial disposition of any two interacting bases. Moreover, the program characterizes each bp by commonly used names (WC, reverse WC, Hoogsteen, reverse Hoogsteen, wobble, sheared, imino, Calcutta, and dinucleotide platform), the Saenger classification scheme of 28 types, and the Leontis-Westhof nomenclature of 12 basic geometric classes. DSSR also checks for non-pairing interactions (H-bonds or base stacking).

DSSR detects triplets and even higher-order base associations by searching horizontally in the plane of the associated bp for further H-bonding interactions. The program determines helical regions by exploring each bp’s neighborhood vertically for base-stacking interactions, regardless of backbone connection (e.g., coaxial stacking of helices or pseudo helices). Moreover, each helix/stem is characterized by a least-squares fitted helical axis to allow for easy quantification of relative helical geometry. DSSR calculates commonly used backbone (including the virtual η/θ) torsion angles, classifies the main chain backbone into BI/BII conformation and the sugar into C2’/C3’-endo like pucker, identifies A-minor interactions (types I and II), ribose zippers, G quartets, hairpin loops, kissing loops, bulges, internal loops and multi-branch loops (junctions). It also detects the existence of pseudo-knots, and outputs RNA secondary structure in the dot-bracket notation.

Experienced 3DNA users may notice that some of the above outlined functionality (e.g., calculation of torsion angles, identification of all pairs, higher order base associations, and helices) have existed for over a decade. Over the years, I have written several posts (see What can 3DNA do for RNA structures?, and links therein) to advocate 3DNA’s applications in RNA structural analysis. Nevertheless, 3DNA has never been widely used in the RNA structure community, for various possible reasons: (1) the misconception that 3DNA is only for DNA (but not RNA); (2) the basic functionality is split into two programs (find_pair and analyze), and needs to be run several times with different options (default find_pair, and with -s, or -p). Thus even though 3DNA is applicable to RNA structures, it is unnecessarily complicated and confusing (especially to new 3DNA users); (3) 3DNA is command-line driven, consisting of many C programs and scripts, with different styles in specifying options. It has the ‘reputation’ of being powerful, but cryptic and hard to use.

I’ve created DSSR from scratch to take consideration of these factors, by employing my extensive experience in supporting 3DNA, an increased knowledge in RNA structures and refined C programming skills. Implemented in ANSI C as a stand-alone command-line program, DSSR is self-contained. Its executables (on MacOS X, Linux and Windows) have zero runtime dependencies. No setup is necessary; simply put the program into a folder of your choice (preferably one on your command PATH), and it should work. DSSR has sensible default settings and an intuitive output, making it directly accessible to a much broader audience than 3DNA per se. Since its initial release on March 3, 2013, I’ve yet to hear any installation or usage problem. So far, all reported bugs have been verified and fixed promptly. The latest beta release has been checked against all nucleic-acid-containing entries in the PDB, without any known issues.

Overall, DSSR consolidates, refines, and significantly extends 3DNA’s functionality for RNA structural analysis. There are more in DSSR than its simple interface suggests. Piecewise, DSSR may appear nothing new, yet combined together, it has unique features not available anywhere else. Its value will be gradually appreciated as DSSR becomes more widely used by the community. Want to know if your structure contains any Hoogsteen pair, sheared G•A pair, or a dinucleotide platform? DSSR can check it for you, easily.

DSSR-beta already possesses all the basic functionality and has been well tested to serve as a handy tool for RNA structural analysis. I stand firmly behind DSSR, and strive to continuously improve the program. Give it a try, and report back on the 3DNA Forum any issues you have. As always, I respond quickly and concretely to all questions posted there. I hope you enjoying using DSSR as much as I enjoy creating and supporting it!

1 This post was published on March 29, 2013, shortly after the beta releases of DSSR [note added on March 15, 2014].

2 DSSR also works for DNA, or DNA-protein complexes, as far as the basic functionality is concerned. Moreover, the acronym could have two other possible interpretations, as would be obvious when the program gains a wider recognition.

Comment [2]


Number of base pairs with at least two inter-base H-bonds: 28 or 29?

Early on when I started on DNA structures, I read Saenger’s book Principles of Nucleic Acid Structure and became familiar with his classification of the 28 possible base-pairs (bps) for A, G, U(T), and C involving at least two (cyclic) hydrogen bonds (see figure below).

The 28 possible base-pairs for A, G, U(T), and C involving at least two (cyclic) hydrogen bonds.

Later on, I read from the 2nd edition of The RNA World book a list of 29 bps compiled by Burkard, Turner & Tinoco. While the one bp discrepancy (28 vs 29) has been in my mind for quite a long while, I had never paid much attention to the issue until recently while adding classifications of RNA bps (among many other functionalities) to 3DNA. A Google search did not help solve the puzzle, so I decided to dig it out by comparing the two lists.

The Burkard et al. list is titled Structures of Base Pairs Involving at Least Two Hydrogen Bonds and it mentions specifically Saenger’s list:

The structures of 29 possible base pairs that involve at least two hydrogen bonds are given in Figures 1–5 (for further descriptions, see Saenger, in Principles of nucleic acid structure, p. 120. Springer-Verlag [1984]).

However, in the five figures, Burkard et al. do not provide the corresponding Saenger numbers (I to XXVIII, 1—28) for the 28 common bps; thus it is not immediately obvious which one (i.e., the new addition by Burkard et al.) is missing from Saenger’s list. Under careful scrutiny, the absent bp turns out to be the “G•C N3-amino, amino-N3” pair in Figure 3: “Six possible flipped purine-pyrimidine mismatches.” One example of such G+C pair is found in the 5S ribosomal RNA (chain 9, G3022—C3026) of Haloarcula marismortui in PDB entry 1vq8.

The G+C pair missing from Saenger's list

The above figure shows clearly that the G+C bp does indeed have two canonical H-bonds between base atoms, and it is difficult to speculate how it escaped Saenger’s selection criteria. In the upcoming new 3DNA component, I am listing this bp as number XXIX (29), along with the other 28 base pairs.



The Calcutta U-U base pair

Recently, I came across the so-called Calcutta U-U base pair (bp) [see figure below] while reading articles on C-H…O contacts in nucleic acid structures. Not familiar with this named pair before, I was curious to find out what it’s about. After some searching, I traced the origin of the Calcutta U-U bp to the following two papers published by Sundaralingam’s group during the middle 1990s:

We have called the novel U•U base pair, where the Hoogsteen face of one of the pyrimidines is involved in a C5-H—O4 hydrogen bond, the ‘Calcutta Base Pair’, since it was announced at the International Seminar-cum-School on Macromolecular Crystallographic Data held in Calcutta, November 16-20, 1995.

We recently discovered a novel U•U base pair, referred to as the Calcutta base pair, in the crystal structure of an RNA hexamer UUCGCG (Ref. 18). The two uracil bases form a conventional N(3)-H…O(4) and an unconventional C(5)-H…O(2) hydrogen bond (Fig. 3a). The C-H…O interaction is entirely ‘voluntary’ and not ‘forced’, underlining its importance in base mispairing.

3DNA has no problem to identify the Calcutta U-U bps (or any pair for that matter); an example is shown below based on the RNA hexamer UUCGCG structure (PDB entry: 1osu) solved by Sundaralingam and colleagues.

Calcutta U-U pair

In the new 3DNA component I’ve been working on (and to be released soon), the Calcutta U-U pair is characterized as below:

1/A.U1 3/A.U2 [U-U] Calcutta 00-n/a tHW -MW
  anti C3'-endo 8.9 --- anti C3'-endo 30.3
  dcc=11.18  dnn=8.48  dmm=7.58  tor=-174.1
  H-bonds[2]: "O4(carbonyl)-N3(imino)[2.76]; C5-O4(carbonyl)[3.27]"

  Shear=-3.67   Stretch=-0.52     Stagger=-0.89
  Buckle=-1.41  Propeller=-16.03  Opening=-90.67

The Calcutta pair is explicitly named, along with other named base pairs (e.g., Watson-Crick [WC], Wobble, and Hoogsteen bps). It is classified as type tHW (trans with Hoogsteen/WC interacting edges), following the commonly used Leontis-Westhof nomenclature. It does not belong to any of the 28 bps (00-n/a) with at least two conventional H-bonds, as categorized by Saenger. In 3DNA, the Calcutta U-U pair is of M-N type, designated as -MW.

Among the well-known named base pairs, some are after the scientists who discovered them (e.g., WC and Hoogsteen bps), while others are based on chemical/geometrical features (e.g., Wobble and Sheared G-A bps), or a combination of both (e.g., reversed WC/Hoogsteen bps). The Calcutta U-U pair is unique in that it is named after a place in India:

Kolkata, or Calcutta, is the capital of the Indian state of West Bengal. … While the city’s name has always been pronounced Kolkata or Kolikata in Bengali, the anglicized form Calcutta was the official name until 2001, when it was changed to Kolkata in order to match Bengali pronunciation.



Analysis of molecular dynamics simulations trajectories

Prior to v2.1, 3DNA does not provide any direct support for the analysis of molecular dynamics (MD) simulations trajectories of nucleic acid structures. Nevertheless, over the years, I noticed some significant applications of 3DNA in the active MD field; see my blog post (December 6, 2009) titled 3DNA in the PCCP nucleic acid simulations themed issue. In January 2011, I released a set of two Ruby scripts specifically aimed to facilitate the analysis of MD simulations trajectories. Thereafter (as of 3DNA v2.1), I have significantly refined and expanded the Ruby scripts, and consolidated the functionality under one umbrella, x3dna_ensemble with multiple sub-commands (analyze, block_image, extract, and reorient). I believe x3dna_ensemble would make it straightforward to analyze ensembles (NMR or MD simulations trajectories) of nucleic acid structures.

Under this background, I am glad to read recently an article titled Structure, Stiffness and Substates of the Dickerson-Drew Dodecamer in J. Chem. Theory Comput. where 3DNA was used extensively. This work represents a re-visit of the classic Dickerson−Drew B-DNA dodecamer d-[CGCGAATTCGCG]2 using state-of-the-art MD simulations with different ionic conditions and solvation models, and compares the MD trajectories with modern crystallographic and NMR data. Among the author list (Tomas Drsata, Alberto Perez, Modesto Orozco, Alexandre Morozov, Jiri Sponer, and Filip Lankas) are some well-known figures in the MD field of nucleic acid structures.

Reading through the text, I am not sure if the newly available functionality of x3dna_ensemble was used. From the excerpts of the citations given below, however, it seems obvious that 3DNA is now well-accepted by the MD community.

Snapshots taken in 10 ps intervals were analyzed using the 3DNA program.43 From 3DNA outputs, time series of conformational parameters were extracted. These included the intra-base-pair coordinates (buckle, propeller, opening, shear, stretch, and stagger), inter-base-pair or step coordinates (tilt, roll, twist, shift, slide, and rise) as well as groove widths (based on P−P distances), backbone torsions, and sugar puckers.

Contrary to the original work of Lankas et al.,31 the intra-base-pair and step coordinates used here are those defined by 3DNA.43

Here, we apply this model together with the 3DNA definitions of the intra-base-pair and step coordinates.43

However, important differences remain, and non- negligible differences are in fact observed between individual experimental structures also in the central part of DD, even though the intra-base-pair and step coordinates are computed using the same coordinate definitions64 (we consistently use the 3DNA coordinates in this work).

Comment [4]


Application of the mutate_bases program in inferring statistical protein-DNA potentials

Thanks to Google scholar, I recently become aware of the article by Mohammed AlQuraishi & Harley McAdams (2012) Three enhancements to the inference of statistical protein-DNA potentials” in Proteins: Structure, Function, and Bioinformatics. Reading through the text, I like it quite a bit. The abstract summarize the work well:

The energetics of protein-DNA interactions are often modeled using so-called statistical potentials, that is, energy models derived from the atomic structures of protein-DNA complexes. Many statistical protein-DNA potentials based on differing theoretical assumptions have been investigated, but little attention has been paid to the types of data and the parameter estimation process used in deriving the statistical potentials. We describe three enhancements to statistical potential inference that significantly improve the accuracy of predicted protein-DNA interactions: (i) incorporation of binding energy data of protein-DNA complexes, in conjunction with their X-ray crystal structures, (ii) use of spatially-aware parameter fitting, and (iii) use of ensemble-based parameter fitting. We apply these enhancements to three widely-used statistical potentials and use the resulting enhanced potentials in a structure-based prediction of the DNA binding sites of proteins. These enhancements are directly applicable to all statistical potentials used in protein-DNA modeling, and we show that they can improve the accuracy of predicted DNA binding sites by up to 21%.

I’m glad to find that the 3DNA mutate_bases program was used in deriving the statistical potentials of protein-DNA interactions:

The relative binding affinity of a protein to two different DNA sequences can be evaluated by computing the binding energy of the protein to those two sequences. This is done by mutating the DNA sequence in silico while keeping the protein fixed. We used the 3DNA software package for mutating DNA23,24, which maintains the backbone atoms of the DNA molecule but replaces the basepair atoms in a way that is consistent with the backbone orientation of the DNA.

For each base position, in silicon structural mutants are generated using 3DNA23,24 to mutate the basepair to include all four possibilities.

This is exactly one of the use cases I have in mind while creating the program:

Overall, mutate_bases has been designed to solve the in silica base mutation problem in a practical sense: robust and efficient, getting its job done and then out of the way. The program can have many possible applications: in addition to perform base-pair mutations in DNA-protein complexes, it should also prove handy in RNA modeling and in providing initial structures for QM/MM/MD energy calculations, and in DNA/RNA modeling studies.

With the recent refinement to allow for 3-letter nucleotide name in the standard base-reference frame file, mutate_bases now makes it exceedingly easy to mutate cytosine to 5-methylcytosine.

As more people get to know this 3DNA functionality, I am confident that mutate_bases will be more widely used.



Quantification of base-stacking interactions using overlap area

Base-stacking interactions stabilize nucleic acid structures. Many ways exist to account for such interactions, including quantum chemical calculations (see for example the review by Sponer et al. [2008] on Nature and magnitude of aromatic stacking of nucleic acid bases.). In 3DNA, base-stacking interactions are assessed from planar projections of the ring and exocyclic atoms in consecutive bases or base pairs; the larger the overlap area, the stronger the stacking interactions, and vice versa.

Over the years, I’ve seen a few publications taking advantage of this 3DNA parameter. Here are two recent ones:

To analyze the role of the sequence regularity for the double-helical structure, we calculated the overall overlapping of base pairs (stacking) at every step of the two duplexes of 20mer pG(CUG)6C and the duplex of 19mer pGG(CGG)3(CUG)2CC using the program 3DNA (Lu & Olson, 2003).

Basepair overlap values are calculated by 3DNA software.35

Hopefully, more 3DNA users would notice this ‘little’ feature and make good use of it.



Named base pairs

In the field of nucleic acid structures, especially in the ‘RNA world’, we often hear named base pairs (bp). Among those, the Watson-Crick (WC) A–U and G–C bps (see figure below) are by far the most common.

Watson-Crick base pairs

Reversed WC (rWC) base pairs

Closely related to the WC bps are the so-called reversed WC (rWC) bps, where the relative glycosidic bond are reversed; instead of being on the same side of the bases as in WC bps shown above, they are now on opposite sides in rWC bps as shown below. According to the Leontis-Westhof (LW) bp classification scheme, the rWC bps belong to trans WC/WC. Following Saenger’s numbering, the rWC A+U bp corresponds to XXI, and the rWC G+C bp XXII.

In the figures below, the name of each type of bp and its LW & Saenger designations (separated by ‘;’) are noted under the corresponding image. All images are generated with 3DNA; for easy comparison, each bp is oriented in the reference frame of the leading base.

Reversed Watson-Crick A+U pair Reversed Watson-Crick G+C pair
Reversed WC A+U pair Reversed WC G+C pair
trans WC/WC; XXI trans WC/WC; XXII

Hoogsteen and reversed Hoogsteen base pairs

The next most famous one is the Hoogsteen A+U bp, which also has a reverse variant, i.e., the rHoogsteen A–U bp (see figure below). Now the major groove edge of A, termed the Hoogsteen edge by LW, is used for pairing with U.

Hoogsteen A+U pair Reversed Hoogsteen A–U pair
Hoogsteen A+U pair Reversed Hoogsteen A–U pair
cis Hoogsteen/WC; XXIII trans Hoogsteen/WC; XXIV

The G–U Wobble base pair

First proposed by Crick in 1966 to account for the degeneracy in codon–anticodon pairing, the Wobble bp is an essential component (in addition to the WC bps) in forming double helical RNA secondary structures.

Wobble G–U pair
Wobble G–U pair

The sheared G–A base pair

Sheared G–A is a commonly found non-WC bp in both DNA and RNA structures. Noticeably, tandem sheared G–A bps introduce distinct stacking geometry. Here G uses its minor groove edge, termed the sugar edge by LW, to pair with the Hoogsteen edge of A.

Sheared G–A pair
Sheared G–A pair
trans Suger/Hoogsteen; XI

Dinucleotide platforms

Dinucleotide platforms are formed via side-by-side pairing of adjacent bases; the most common of which are GpU and ApA. Here the sugar (minor-groove) edge of the 5′ base interacts with the Hoogsteen (major-groove) edge of the 3′ base. Since there is only one base-base H-bond in dinucleotide platforms, no Saenger classification is available. In 3DNA output, the GpU dinucleotide platform is designated as G+U, and ApA as A+A.

GpU dinucleotide platform ApA dinucleotide platform
GpU dinucleotide platform ApA dinucleotide platform
cis Sugar/Hoogsteen; n/a cis Sugar/Hoogsteen; n/a

Other named base pairs

There exist other named bps in RNA literature, e.g., G⋅A imino, A⋅C reverse Hoogsteen, G⋅U reverse Wobble etc. In the my experience, they are (much) less commonly used than the ones illustrated above.

Comment [2]


Unusual glycosidic bond in nucleic acid structures in the PDB/NDB

Glycosidic bond “is a type of covalent bond that joins a carbohydrate (sugar) molecule to another group, which may or may not be another carbohydrate.” In nucleic acid structures, the other group is a nucleobase, and the predominated type is the N-glycosidic bond where the purine (A/G) N9 or pyrimidine (C/T/U) N1 atom connects to the C1′ atom of the five-membered (deoxy) ribose sugar ring. Another well-known type is the C-glycosidic bond in pseudouridine, the most common modified base in RNA structures where the C5 atom instead of N1 is linked to the C1′ atom of the sugar ring.

N-glycosidic bond in U vs C-glycosidic bond in pesudoU

Recently, I performed a survey of all nucleic-acid-containing structures in the PDB/NDB database to see how many types of glycosidic bond are there. As always, I noticed some inconsistencies in the data: nucleotides with disconnected base/sugar, a base labeled as U but with pseudoU-type C-glycosidic bond. Shown below are a few unusual types of glycosidic bond in otherwise seemingly “normal” structures:

  • The residue GN7 (number 28 on chain A) in PDB entry 1gn7 contains a N7-glycosylated guanine.

N7-glycosylated guanine

  • The residue UPG (number 501 on chain A) in PDB entry 1y6f has sugar C1C (instead of C1′) atom connects to N1 of U.

C1C links to N1 of U

  • The residue XAE (number 11 on chain B) in PDB entry 2icz contains a benzo-homologous adenine.

xA in the benzo-homologous xDNA

  • The residue F5H (number 206 on chain B) in PDB entry 3v06 has N1 of U connects to C2′ of a six-membered sugar ring.

N1(U) connects to C2′

The unusual glycosidic bond has implications in 3DNA calculated parameters, for example the chi torsion angle. Identifying such cases would help refine 3DNA to provide sensible parameters and to avoid possible misinterpretations.

Comment [1]


FRETmatrix: a methodological platform for the simulation and analysis of FRET in nucleic acids

In the ‘Advance Access’ section of Nucleic Acids Research, published on September 12, 2012 (DOI: 10.1093/nar/gks856), I came across the paper FRETmatrix: a general methodology for the simulation and analysis of FRET in nucleic acids by Søren Preus et al.. In this work, the authors developed a methodological platform (implemented in the Matlab package FRETmatrix’) to simulate the base-base FRET in order to elucidate the structure and dynamics of nucleic acids.

Reading through the text, I am pleased to find that the authors take advantage of the matrix-based Calladine and El Hassen Scheme (CEHS) for ‘building nucleic acid geometrical models’, and kindly cite SCHNArP, 3DNA, and the standard base-reference frame paper. They provide a succinct description of the model building process, and also note the connection between CEHS and SCHNArP. From the very beginning, I appreciated the elegance of the CEHS method — it is simple, mathematical rigorous, and generally applicable for quantifying the relative position and orientation between any two rigid bodies. SCHNAaP/SCHNArP implements the analysis/rebuilding components of CEHS in an expanded form, and CEHS further serves as a corner stone of 3DNA.

Another point worth noting is Figure 3 (see below) where the authors present (a–c) Representative examples of output geometries produced by FRETmatrix (right) along with the block representation of the corresponding structures produced by 3DNA (28) (left). To the best of my memory, this is one of the very few times where 3DNA’s blocview functionality is explicitly cited.

geometrical model building combined with FRET simulations in three model structures



The number of 3DNA forum registrations has reached 500

As of today (2012-09-16), the number of 3DNA forum registrations has reached 500! A quick browse of the ‘Statistics Center’ shows that over 80% of the registrations (400+) are after March 2012, when the new 3DNA homepage/forum were launched.

The sharp increase in registration is mostly due to the streamlined, web-based way to distribute the 3DNA software package. As far as I know, the number of 3DNA registrations/downloads in the past six months is significantly higher than that of 3DNA v2.0 for over three years. Equally importantly, I have been able to fixed every reported bug, addressed each feature request, and updated the 3DNA v2.1 distribution promptly.

I also feel confident to declare that up to now, the 3DNA Forum is spam free (at least to the extent I am aware). To this end, I’ve taken the following three measures:

  • Installation of the SMF “Mod Stop Spammer”; as of this writing, it shows “3920 Spammers blocked up until today”.
  • By using 3DNA-related verification questions. At its current setting, a user must answer correctly three of the ‘simple’ yet effective verification questions. Early on, I decided deliberately not to use CAPTCHA as an anti-spam means, based on my past experience.
  • I’ve continuously monitored (new) registrations, and taken immediate actions against any suspicious registration. Due to the effectiveness of above two steps, so far I only have to manually handle just a few spam registrations. Nevertheless, it does illustrate the fact that no automatic method is perfect, and expert inspection is required to ensure desired results.

Overall, the new simplified way to distribute the 3DNA software package is working as intended; now users can easily access all distributed versions of 3DNA, and I can focus on support and further development of the software.



Classification of dinucleotide steps into A- and B- and TA-DNA

From v1.5 or even earlier on, 3DNA provides an automatic classification of a dinucleotide step into A-, B- or TA-DNA conformation. Figure 5 of the 2003 3DNA Nucleic Acids Research paper (NAR03) shows three sets of scatter plots — helical inclination and x‐displacement, dimer step Roll and Slide, and the projected phosphorus z coordinates Zp and Zp(h) — to differentiate the A-, B- and TA-DNA dinucleotide steps.

Classification of A-, B- and TA-DNA dinucleotide steps

Among the criteria tested, the most discriminative ones are the projected phosphorus z coordinates, Zp in the middle step frame (see figure below), and Zp(h) defined similarly but in the middle helical frame.

definition of the Zp parameter

Over the years, I have received many questions regarding the datasets used in generating Figure 5 of NAR03. Back in August 2006, a user asked for IDs of the TA-DNA structures — see DNA standards/statistics using 3DNA. In April 2007, another user requested the same TA-DNA dataset. Early this year, a user asked for 3DNA’s A-DNA definition. More recently, yet another user would like to ask about the DNA set used for the analysis that is presented in Fig 5. in the NAR 2003 paper.

I am glad to see that after nearly a decade of the NAR03 publication, the user community is still interested in knowing details in the work. So I decided to dig into my archive for the original data files and scripts used to generate Figure 5 of NAR03. It was not an easy journey; just releasing the data files and scripts is not enough, I’d like to verify that they work together as intended in today’s computing environment. Luckily, I am finally able to get to the bottom of the issues. The details are in the post Datasets and scripts for reproducing Figure 5 of the 3DNA NAR03 paper. The tarball file named 3DNA-NAR03-Fig5.tar.gz is available by clicking the link.



Four papers in the August 2012 [40(14)] issue of NAR cite 3DNA

While browsing the August 2012 40(14) issue of Nucleic Acids Research (NAR), I noticed the following four papers that cite 3DNA:

The local base pair step parameters as calculated by x3dna (37,38) are represented in the Supplementary Figure S2.

The initial extended single-stranded DNA structure was obtained using the 3DNA program (15).

DNA structures were analyzed using 3DNA (31).

Each of these DNA structural models consists of values for all base-pair step parameters (roll, twist, tilt, rise, shift and slide) for each dinucleotide or trinucleotide. This enabled us to convert DNA sequences into 3D coordinates by using the rebuilding part of 3DNA (39), a program for analysis, rebuilding and visualization of 3D nucleic acid structures.

The above four NAR papers appear in the sections “Nucleic Acid Enzymes” (1), “Structural Biology” (2) and “Methods Online” (1), and cover research areas of DNA-protein interactions (3) and G-quadruplex structures (1). As quoted above, two papers employ the analyzing components of 3DNA, while the other two take advantage of its rebuilding facilities.

Between the two primary 3DNA publications, the 2003 NAR paper (NAR03) is cited twice, while the 2008 Nature Protocol paper (NP08) is cited three times. Apparently, after some time lag, NP08 has gradually overpassed NAR03 to become the community’s favorite citation for 3DNA.



Rectangular block expressed in PDB format

As noted in post Rectangular block expressed in MDL molfile format, I added the -mol option (in v2.1) to convert 3DNA’s native alchemy to the better-supported MDL molfile format, to make the characteristic schematic representations more widely accessible. Along the line, I have recently further augmented alc2img with the -pdb option to transform alchemy to the PDB format.

While the macromolecular PDB format is certainly not convenient for specifying linkage details of small molecules, it’s nevertheless the best-documented and by far the most widely supported than molfile or alchemy in currently available molecular viewers. For example, the PDB format is consistently supported in Jmol, PyMOL, RasMol, DeepView, and UCSF Chimera. Moreover, the PDB format does have the CONECT section to provide information on atomic connectivity:

The CONECT records specify connectivity between atoms for which coordinates are supplied. The connectivity is described using the atom serial number as shown in the entry. CONECT records are mandatory for HET groups (excluding water) and for other bonds not specified in the standard residue connectivity table.

The alc2img -pdb option takes advantage of the CONECT records and specifies all ‘bond’ linkages explicitly. The usage is very simple — take the standard base-pair rectangular block file (‘Block_BP.alc’) as an example, the conversion can be performed as below:

alc2img -pdb Block_BP.alc Block_BP.pdb

Content of ‘Block_BP.alc’

   12 ATOMS,    12 BONDS
    1 N      -2.2500   5.0000   0.2500
    2 N      -2.2500  -5.0000   0.2500
    3 N      -2.2500  -5.0000  -0.2500
    4 N      -2.2500   5.0000  -0.2500
    5 C       2.2500   5.0000   0.2500
    6 C       2.2500  -5.0000   0.2500
    7 C       2.2500  -5.0000  -0.2500
    8 C       2.2500   5.0000  -0.2500
    9 C      -2.2500   5.0000   0.2500
   10 C      -2.2500  -5.0000   0.2500
   11 C      -2.2500  -5.0000  -0.2500
   12 C      -2.2500   5.0000  -0.2500
    1     1     2
    2     2     3
    3     3     4
    4     4     1
    5     5     6
    6     6     7
    7     7     8
    8     5     8
    9     9     5
   10    10     6
   11    11     7
   12    12     8

Content of ‘Block_BP.pdb’

REMARK    3DNA v2.1 (c) 2012 Dr. Xiang-Jun Lu (
HETATM    1  N   ALC A   1      -2.250   5.000   0.250  1.00  1.00           N  
HETATM    2  N   ALC A   1      -2.250  -5.000   0.250  1.00  1.00           N  
HETATM    3  N   ALC A   1      -2.250  -5.000  -0.250  1.00  1.00           N  
HETATM    4  N   ALC A   1      -2.250   5.000  -0.250  1.00  1.00           N  
HETATM    5  C   ALC A   1       2.250   5.000   0.250  1.00  1.00           C  
HETATM    6  C   ALC A   1       2.250  -5.000   0.250  1.00  1.00           C  
HETATM    7  C   ALC A   1       2.250  -5.000  -0.250  1.00  1.00           C  
HETATM    8  C   ALC A   1       2.250   5.000  -0.250  1.00  1.00           C  
HETATM    9  C   ALC A   1      -2.250   5.000   0.250  1.00  1.00           C  
HETATM   10  C   ALC A   1      -2.250  -5.000   0.250  1.00  1.00           C  
HETATM   11  C   ALC A   1      -2.250  -5.000  -0.250  1.00  1.00           C  
HETATM   12  C   ALC A   1      -2.250   5.000  -0.250  1.00  1.00           C  
CONECT    1    2    4                                                  
CONECT    2    1    3                                                  
CONECT    3    2    4                                                  
CONECT    4    1    3                                                  
CONECT    5    6    8    9                                             
CONECT    6    5    7   10                                             
CONECT    7    6    8   11                                             
CONECT    8    5    7   12                                             
CONECT    9    5                                                       
CONECT   10    6                                                       
CONECT   11    7                                                       
CONECT   12    8                                                       



Effect of reversing strands of a DNA duplex on 3DNA calculated parameters

From a pure structural perspective, the designation of the two strands in an anti-parallel DNA duplex is sort of arbitrary. Thus, for a given PDB file, let’s assume that the atomic coordinates of chain A (strand I) come before those of chain B (strand II). We can swap the order of the two chains as they appear in the PDB file, i.e., list first the atomic coordinates of chain B and then those of chain A.

Structurally, the two settings corresponding to exactly the same DNA molecule. As far as 3DNA goes, however, the different orderings do make a different in calculated parameters. Using the Dickerson B-DNA dodecamer CGCGAATTCGCG solved at high resolution (PDB entry 355d) as an example, running 3DNA find_pair and analyze on ‘355d.pdb’ gives the results (abbreviated) below:

find_pair 355d.pdb 355d.bps
    # contents of file '355d.bps':
    2         # duplex
   12         # number of base-pairs
    1    1    # explicit bp numbering/hetero atoms
    1   24  0 #    1 | ....>A:...1_:[.DC]C-----G[.DG]:..24_:B<....
    2   23  0 #    2 | ....>A:...2_:[.DG]G-----C[.DC]:..23_:B<....
    3   22  0 #    3 | ....>A:...3_:[.DC]C-----G[.DG]:..22_:B<....
    4   21  0 #    4 | ....>A:...4_:[.DG]G-----C[.DC]:..21_:B<....
    5   20  0 #    5 | ....>A:...5_:[.DA]A-----T[.DT]:..20_:B<....
    6   19  0 #    6 | ....>A:...6_:[.DA]A-----T[.DT]:..19_:B<....
    7   18  0 #    7 | ....>A:...7_:[.DT]T-----A[.DA]:..18_:B<....
    8   17  0 #    8 | ....>A:...8_:[.DT]T-----A[.DA]:..17_:B<....
    9   16  0 #    9 | ....>A:...9_:[.DC]C-----G[.DG]:..16_:B<....
   10   15  0 #   10 | ....>A:..10_:[.DG]G-----C[.DC]:..15_:B<....
   11   14  0 #   11 | ....>A:..11_:[.DC]C-----G[.DG]:..14_:B<....
   12   13  0 #   12 | ....>A:..12_:[.DG]G-----C[.DC]:..13_:B<....

analyze 355d.bps
    # generate output file '355d.out', with base-pair step parameters:
    step       Shift     Slide      Rise      Tilt      Roll     Twist
   1 CG/CG      0.09      0.04      3.20     -3.22      8.52     32.73
   2 GC/GC      0.50      0.67      3.69      2.85     -9.06     43.88
   3 CG/CG     -0.14      0.59      3.00      0.97     11.30     25.11
   4 GA/TC     -0.45     -0.14      3.39     -1.59      1.37     37.50
   5 AA/TT      0.17     -0.33      3.30     -0.33      0.46     37.52
   6 AT/AT     -0.01     -0.60      3.22     -0.31     -2.67     32.40
   7 TT/AA     -0.08     -0.40      3.22      1.68     -0.97     33.74
   8 TC/GA     -0.27     -0.23      3.47      0.68     -1.69     42.14
   9 CG/CG      0.70      0.78      3.07     -3.66      4.18     26.58
  10 GC/GC     -1.31      0.36      3.37     -2.85     -9.37     41.60
  11 CG/CG     -0.31      0.21      3.17     -0.68      6.69     33.31

Reversing the order of chains A and B in ‘355d.pdb’ as ‘355d-reversed.pdb’ and repeating the above procedure, we have the following results:

find_pair 355d-reversed.pdb 355d-reversed.bps
    # contents of file '355d-reversed.bps':
    2         # duplex
   12         # number of base-pairs
    1    1    # explicit bp numbering/hetero atoms
    1   24  0 #    1 | ....>B:..13_:[.DC]C-----G[.DG]:..12_:A<....
    2   23  0 #    2 | ....>B:..14_:[.DG]G-----C[.DC]:..11_:A<....
    3   22  0 #    3 | ....>B:..15_:[.DC]C-----G[.DG]:..10_:A<....
    4   21  0 #    4 | ....>B:..16_:[.DG]G-----C[.DC]:...9_:A<....
    5   20  0 #    5 | ....>B:..17_:[.DA]A-----T[.DT]:...8_:A<....
    6   19  0 #    6 | ....>B:..18_:[.DA]A-----T[.DT]:...7_:A<....
    7   18  0 #    7 | ....>B:..19_:[.DT]T-----A[.DA]:...6_:A<....
    8   17  0 #    8 | ....>B:..20_:[.DT]T-----A[.DA]:...5_:A<....
    9   16  0 #    9 | ....>B:..21_:[.DC]C-----G[.DG]:...4_:A<....
   10   15  0 #   10 | ....>B:..22_:[.DG]G-----C[.DC]:...3_:A<....
   11   14  0 #   11 | ....>B:..23_:[.DC]C-----G[.DG]:...2_:A<....
   12   13  0 #   12 | ....>B:..24_:[.DG]G-----C[.DC]:...1_:A<....

analyze 355d-reversed.bps
    # generate output file '355d-reversed.out', with base-pair step parameters:
    step       Shift     Slide      Rise      Tilt      Roll     Twist
   1 CG/CG      0.31      0.21      3.17      0.68      6.69     33.31
   2 GC/GC      1.31      0.36      3.37      2.85     -9.37     41.60
   3 CG/CG     -0.70      0.78      3.07      3.66      4.18     26.58
   4 GA/TC      0.27     -0.23      3.47     -0.68     -1.69     42.14
   5 AA/TT      0.08     -0.40      3.22     -1.68     -0.97     33.74
   6 AT/AT      0.01     -0.60      3.22      0.31     -2.67     32.40
   7 TT/AA     -0.17     -0.33      3.30      0.33      0.46     37.52
   8 TC/GA      0.45     -0.14      3.39      1.59      1.37     37.50
   9 CG/CG      0.14      0.59      3.00     -0.97     11.30     25.11
  10 GC/GC     -0.50      0.67      3.69     -2.85     -9.06     43.88
  11 CG/CG     -0.09      0.04      3.20      3.22      8.52     32.73

Comparing the base-pair step parameters between ‘355d.out’ and ’355d-reversed.out’, one would notice that while slide/rise/roll/twist simply switch orders, shift/tilt (the x-axis parameters) also flip their signs. On the other hand, the nucleotide serial numbers specifying base pairs (the left two columns) are identical in ‘355d.bps’ and ’355d-reversed.bps’.

Apart from explicitly swapping the two strands in PDB data file, one can simply switch around the nucleotide serial numbers generated with find_pair in order to analyze a DNA duplex based on its complementary sequence instead of the primary one. For example, starting from the same PDB file ‘355d.pdb’, we change ‘355d.bps’ to ’355d-cs.bps’ as below,

    2         # duplex
   12         # number of base-pairs
    1    1    # explicit bp numbering/hetero atoms
   13   12
   14   11
   15   10
   16    9
   17    8
   18    7
   19    6
   20    5
   21    4
   22    3
   23    2
   24    1

Run analyze 355d-cs.bps, one would get exactly the same parameters in output file ’355d-cs.out’ as in ’355d-reversed.out’.



Schematic diagrams of base-pair parameters

Ever since the 2003 publication of the initial 3DNA Nucleic Acids Research paper (NAR03), the schematic diagrams of base-pair parameters (see figure below) has become quite popular. Over the years, we have received numerous requests for permission to use the figure, or a portion thereof; as an example, the figure has been adopted into a structural biology textbook. In the 2008 3DNA Nature Protocols paper (NP08), we devoted the very first protocol to “create a schematic image for propeller of 45°”.

Schematic diagram of rigid body parameters

Figure legend taken from Figure 1 of NAR03: Pictorial definitions of rigid body parameters used to describe the geometry of complementary (or non‐complementary) base pairs and sequential base pair steps (19). The base pair reference frame (lower left) is constructed such that the x‐axis points away from the (shaded) minor groove edge of a base or base pair and the y‐axis points toward the sequence strand (I). The relative position and orientation of successive base pair planes are described with respect to both a dimer reference frame (upper right) and a local helical frame (lower right). Images illustrate positive values of the designated parameters. For illustration purposes, helical twist (Ωh) is the same as Twist (ω), formerly denoted by Ω (19,20) and helical rise (h) is the same as Rise (Dz).

I recall spending around two weeks to produce the above figure. Content-wise, the figure was constructed in only a short while; it was the little details that took me most of the time.

Over time, I’ve witnessed numerous versions of such schematic images in publications related to DNA/RNA structures. While looking similar, the schematics differ subtly in the magnitude, orientation and relative scale of illustrated parameters. To the best of my knowledge, only 3DNA provides a pragmatic approach to generate the base-pair schematic diagrams consistently.

To make the schematics more readily accessible, I’ve reproduced a high resolution image (in png format) for each of the 14 parameters shown above. You are welcome to pick and match the diagrams as necessary. If you use any of them in your publications, please cite the 3DNA NAR03 and/or NP08 paper(s).

Note that in the schematic diagrams below, the shaded edge (facing the viewer) denotes the minor-groove side of a base or base pair.

Shear (Sx) Stretch (Sy) Stagger (Sz)
Shear Stretch Stagger
Buckle (κ) Propeller (π) Opening (σ)
Buckle Propeller Opening
Shift (Dx) Slide (Dy) Rise (Dz)
Shift Slide Rise
Tilt (τ) Roll (ρ) Twist (ω)
Tilt Roll Twist
x-displacement (dx) y-displacement (dy) Helical Rise (h)
x-displacement y-displacement As for Rise above
(for illustration purpose)
Inclination (η) Tip (θ) Helical Twist (Ωh)
Inclination Tip As for Twist above
(for illustration purpose)

Comment [8]


Perl scripts are obsolete but still available

As of v2.1, I’ve switched from Perl to Ruby as the scripting language for 3DNA. Consequently, the Perl scripts in previous versions of 3DNA (v1.5 and v2.0) are now obsolete. I’ll only correct bugs in existing Perl scripts, but will not add any new features.

For back reference, the scripts are still available from a separate directory $X3DNA/perl_scripts, with the following contents:

OP_Mxyz*          dcmnfile*         nmr_strs*
README            del_ms*           pdb_frag*
block_atom*       expand_ids*       x3dna2charmm_pdb**      manalyze*         x3dna_r3d2png*
bp_mutation*      mstack2img**
cp_std*           nmr_ensemble*

Among them, and have corresponding Ruby versions: x3dna_setup and blocview. Actually, the .pl file extension (for Perl) was added to avoid confusion with the new Ruby scripts.

Some of the functionalities have been incorporated into the Ruby script x3dna_utils:

A miscellaneous collection of 3DNA utilities
    Usage: x3dna_utils [-h|-v] sub-command [-h] [options]
    where sub-command must be one of: 
        block_atom -- generate a base block schematic representation
        cp_std -- select standard PDB datasets for analyze/rebuild
        dcmnfile -- remove fixed-name files generated with 3DNA
        x3dna_r3d2png -- convert .r3d to image with Raster3D or PyMOL
  --version, -v:   Print version and exit
     --help, -h:   Show this message

Along the same line, ensemble-related functionalities (for NMR or molecular dynamics simulations) have been consolidated and extended into the new Ruby script x3dna_ensemble:

Utilities for the analysis and visualization of an ensemble
    Usage: x3dna_ensemble [-h|-v] sub-command [-h] [options]
    where sub-command must be one of: 
        analyze -- analyze MODEL/ENDMDL delineated ensemble (NMR or MD)
        block_image -- generate a base block schematic image
        extract -- extract structural parameters after running 'analyze'
        reorient -- reorient models to a particular frame/orientation
  --version, -v:   Print version and exit
     --help, -h:   Show this message

Conceivably, C programs in 3DNA can also be consolidated. For backward compatibility, however, all existing C programs will be kept — and refined as necessary — in the current 3DNA v2.x series. As of v3.x, I’ll completely re-organize 3DNA incorporating my years of experience in programming languages and knowledge of macromolecular structures.



Specification of base pairs in 3DNA

In 3DNA, each base pair (bp) is specified by the identity of its two comprising nucleotides (nts), and their interactions. Some examples are shown below based on the PDB entry 1ehz (the crystal structure of yeast phenylalanine tRNA at 1.93 Å resolution), with the shorthand form on the right:

....>A:...1_:[..G]G-----C[..C]:..72_:A<....  G-C
....>A:...4_:[..G]G-*---U[..U]:..69_:A<....  G-U
....>A:...9_:[..A]A-**+-A[..A]:..23_:A<....  A+A
....>A:..15_:[..G]G-**+-C[..C]:..48_:A<....  G+C
....>A:..26_:[M2G]g-**--A[..A]:..44_:A<....  g-A

Specification of a nucleotide

The nt specification string consists of 6 fields and follows the pattern below, with the number of characters in each field inside the parentheses:


  1. modelNum(4) — the model number is up to 4 digits, right-justified, with each leading space replaced by a dot. If no model number is available, as is the case for 1ehz (and virtually all other x-ray crystal structures in the PDB), it is written as .... (4 dots).
  2. chainId(1) — the chain id is 1-char long, with space replaced by underscore.
  3. ntNum(4) — the nt residue number, handled as for the model number.
  4. insCode(1) — insertion code, handled as for the chain id.
  5. ntName(3) — the nt residue name is up to 3-char long, right-justified, with each leading space replaced by a dot.
  6. baseName(1) — the base name is 1-char long, mapped from ntName(3) following $X3DNA/config/baselist.dat. Note that modified nucleotides are put in lower case to distinguish them from the canonical ones — for example, M2G to g.

For the complementary base in a bp, the order of the 6 fields is reversed — see examples above. To see the full list of nts in a PDB data file, run: find_pair -s 1ehz.pdb stdout (here using 1ehz as an example).

Specification of a base pair

The pattern of a bp is M-xyz-N, where M and N are 1-char base names (as in aforesaid field #6), and the three characters xyz have the following meaning:

  • z — the sign of the dot product of the z-axes of the M and N base reference frames. It is positive (+) if the two z-axes point in similar directions, as in Hoogsteen or reverse Watson-Crick bps. Conversely, it is negative (-) when the two z-axes point in opposite directions, as in the canonical Watson-Crick and Wobble bps. See figure below:

Watson-Crick (M-N) vs Hoogsteen base pairs

  • y — it is - if M and N are in a so-called Watson-Crick geometry (the two y-axes of the M and N base reference frames are anti-parallel, so are the two z-axes, whilst the two x-axes are parallel), e.g., the G-U Wobble pair; otherwise, *.
  • x — it is - for Watson-Crick bps, otherwise, *.

By design, Watson-Crick bps would be of the pattern M-----N, Wobble bps M-*---N, and non-canonical bps M-**+-N or M-**--N. Thus by browsing through the 3DNA output, users can readily identify these three bp types.

The shortened form is represented as MzN; following aforementioned notation, it can be either M-N or M+N. The relative direction of the two z-axes is critical in effecting 3DNA-calculated bp (and step) parameters, as detailed in the 2003 3DNA NAR paper:

To calculate the six complementary base pair parameters of an M–N pair (Shear, Stretch, Stagger, Buckle, Propeller and Opening), where the two z‐axes run in opposite directions, the reference frame of the complementary base N is rotated about the x2‐axis by 180°, i.e. reversing the y2‐ and z2‐axes in Figure 2a. Under this convention, if the base pair is reckoned as an N–M pair, rather than an M–N pair, the x‐axis parameters (Shear and Buckle) reverse their signs. For an M+N pair, e.g. the Hoogsteen A+U in Figure 2b, the x2‐, y2‐ and z2‐axes do not change sign; thus all six parameters for an N+M pair are of opposite sign(s) from those for an M+N pair.

The M-N and M+N bp designation is unique to 3DNA. In combination with the corresponding 6 bp parameters (shear, stretch, stagger, buckle, propeller, and opening), 3DNA provides a rigorous description of all possible bps. This contrasts and complements with the conventional Saenger scheme and the 3-edge based Leontis/Westhof notation.

The 3DNA M-N vs M+N bp designation is base-centric, without concerning the sugar-phosphate backbone. The chi (χ) torsion angle, which characterizes base/sugar relative orientation, can be in either anti or syn conformation; thus similar backbone(S) can accommodate either M-N or M+N.



Is the O2′(G)...O2P(U) H-bond in GpU platforms real?

Among the findings of our 2010 Nucleic Acids Research (NAR) article titled The RNA backbone plays a crucial role in mediating the intrinsic stability of the GpU dinucleotide platform and the GpUpA/GpA miniduplex, the key is identifying the O2′(G)…O2P(U) H-bond (see figure below). As noted in a previous post What’s special about the GpU dinucleotide platform?, it was an accidental observation while I was preparing a figure for our 2008 3DNA Nature Protocols paper. Trained as a chemist, after scrutinizing the many occurrances of the GpU platforms in the large ribosomal subunit of Haloarcula marismortui (PDB entry 1jj2), I had no doubt that it is an H-bond. Yet, behind the scene, things were never that straightforward: if it is indeed an H-hbond as we’ve claimed, how could it have been missed altogether by the RNA structural biology community?

Structural characterization of the GpU dinucleotide platform, depicting the little-noticed O2′(G)...O2P(U) H-bond in the sugar–phosphate backbone.

Anticipating the potential questions that could be raised by the reviewers, we were extremely careful in characterizing the O2′(G)…O2P(U) H-bond:

  • It is formed between the hydroxyl group (donor) of G and a non-bridging phosphate oxygen atom (O2P, acceptor) of U.
  • The distance between O2′(G) and O2P(U), 2.68 ± 0.14 Å, is perfect for an H-bond.
  • I queried the Cambridge Structure Database for hydroxyl-phosphate H-bonds with similar relative geometry and chemical identity. We found a case in the phospholipid lysophosphatidyl-ethanolamine, where this type of H-bond is highlighted in the abstract: The free glycerol hydroxyl group forms an intramolecular hydrogen bond with a phosphate oxygen and thus affects the conformation and orientation of the head group.
  • I also performed a survey of potential O2′(i)…O2P(i+1) H-bonds within dinucleotides regardless of platform configuration, and detected 1186 such pairwise interactions within a distance cutoff of 3.3 Å in RNA crystal structures of 2.5 Å or better resolution.

Careful as we were, we still failed to convince reviewer #3 of our manuscript, which was originally submitted to the RNA journal and finally rejected following the second round of review. Here is an excerpt related to the O2′(G)…O2P(U) H-bond from reviewer #3’s comment:

The first main concern is that the “new” H-bond interaction that the authors propose as an explanation for the greater occurrence of GU platforms versus di-nucleotide combinations does not make much sense on a fundamental chemical and stereo-chemical point of view. Unless the whole community of chemists and biochemists agree to redefine what an H-bond is, the fact that the 2’OH (i) atom is at 2.68 Å from the O2P atom cannot be the only criteria for an H-bond. In fact, if the authors are the first to mention this H-bond, it is because none of the scientists working in RNA structural biology would have considered this to be an H-bond interaction at the first place! H-bonds are known to be very directional. The O2’-H bond should be aligned with one of the electron doublets of O2P to be able to form a proper H-bond. Acceptable variation could be 20° to 30° degree with respect of a straight H-bond interaction, not 90°! The unique paper that the authors cite for justifying their claim cannot be used as a reference. If the authors want to justify that the close proximity of the 2’OH(i) and O2P is the important factor that contributes to preference of GU platforms versus other platforms, they should undergo quantum mechanics calculations to demonstrate it.

This review is so critical that I saw no point in arguing with it — I certainly have neither the power to “redefine what an H-bond is” nor the expertise to perform quantum mechanics (QM) calculations to validate the O2′(G)…O2P(U) H-bond or otherwise. What is compelling to me about the GpU story from the very beginning is that once this sugar-phosphate H-bond is acknowledged, every other parts of our NAR paper follow naturally and logically. Leaving the chicken or the egg issue alone, our work provides a novel perspective about GpU platform’s predominance, the formation of the bulged-G or loop-E motif, the evolutionary co-occurrence of GpUpA and GpA in the GpUpA/GpA miniduplex, and the extreme conservation of GpU observed at most 5′-splice sites. Put another way, we connect the dots to form a coherent picture that is easily understandable to biologists and chemists.

Luckily, after being re-submitted to NAR, the paper was quickly accepted for publication and even selected as a featured article! As another nice surprise, shortly after it was available online as an Advance Access paper, I received an email from Jiri Sponer. Thereafter, we collaborated on a follow-up paper titled Understanding the Sequence Preference of Recurrent RNA Building Blocks Using Quantum Chemistry: The Intrastrand RNA Dinucleotide Platform. While not unexpected, the results of the state-of-the-art QM calculations were nevertheless reassuring:

The mixed-pucker sugar–phosphate backbone conformation found in most GpU platforms, in which the 5′-ribose sugar (G) is in the C2′-endo form and the 3′-sugar (U) in the C3′-endo form, is intrinsically more stable than the standard A-RNA backbone arrangement, partially as a result of a favorable O2′···O2P intraplatform interaction. Our results thus validate the hypothesis of Lu et al. (Lu, X.-J.; et al. Nucleic Acids Res. 2010, 38, 4868–4876) that the superior stability of GpU platforms is partially mediated by the strong O2′···O2P hydrogen bond. …… In contrast, we show that the dinucleotide platform is not properly described in the course of atomistic explicit-solvent simulations. Our work also gives methodological insights into QM calculations of experimental RNA backbone geometries. Such calculations are inherently complicated by rather large data and refinement uncertainties in the available RNA experimental structures, which often preclude reliable energy computations.

So, the O2′(G)…O2P(U) H-bond is more than likely to be real; at least some other scientists working in RNA structural biology do share our view.

See also: What’s special about the GpU dinucleotide platform?



Reverse Watson-Crick base pairs

While the Watson-Crick (WC) base pairs (bps) are best-known and most abundant in nucleic acid structures (including RNA), the so-called reverse WC bp variants have received little attention. In the well-established Saenger scheme (see figure below), there are 28 possible bps for A, G, U(T), and C in their cononical (keto- and amino-) tautomeric forms and involving at least two H-bonds. The reverse A·T/U and G·C WC pairs are asymmetric, and are numbered XXI and XXII respectively (middle of right-hand side in the figure below).

The 28 possible base-pairs for A, G, U(T), and C involving at least two (cyclic) hydrogen bonds.

In 3DNA, the WC bps are of type M–N and listed as A–T and G–C, consistent with the conventional notation. The reverse WC bps, on the other hand, are of type M+N and listed as A+T and G+C; the ‘+’ signifies the parallel z-axes of the two base reference frames, therefore their dot product is positive (see figure 2 in post Hoogsteen and reverse Hoogsteen base pairs).

As of this writing, a Google search of the phrase “reverse Watson Crick base pair” does not come up with anything informative — the top hit is the Jena Library page titled Nucleic Acid Nomenclature and Structure showing the same set of 28 possible bps only with explicit base chemical structures, as compiled by Tinoco Jr. et al. (1993).

However, once I look into this special type of bps, a quick search in PDB entry 1jj2, the Haloarcula marismortui large ribosomal subunit solved at 2.4 Å resolution, revealed nine reverse WC bps as shown below:

 __U.U..0.205._   __A.A..0.437._   [U+A]
 __C.C..0.1186._  __G.G..0.1190._  [C+G]
 __C.C..0.1377._  __G.G..0.1683._  [C+G]
 __C.C..0.1856._  __G.G..0.1873._  [C+G]
 __A.A..0.2054._  __U.U..0.2648._  [A+U]
 __U.U..0.2109._  __A.A..0.2467._  [U+A]
 __A.A..0.2301._  __U.U..0.2306._  [A+U]
 __A.A..0.2321._  __U.U..0.2378._  [A+U]
 __C.C..0.2510._  __G.G..0.2564._  [C+G] 

The following figure shows a representative reverse WC A+U bp (0.A437 with 0.U205, top), and a representative reverse WC G+C bp (0.G1683 with 0.C1377, bottom). For easy comparison, the two reverse WC bps are orientated in the reference frames of A and G, respectively.

reverse Watson-Crick A+U pair
reverse Watson-Crick G+C pair

In future releases of 3DNA, presumably starting from v2.2, we plan to provide a new component to classify bps according to the Saenger scheme, the Leontis/Westhof notation, and the geometric parameter-based strategy. Overall, the three bp classification methods are complementary in functionality, but with increased sophistication and applicability.

Comment [4]


Hoogsteen and reverse Hoogsteen base pairs

The A·U (or A·T) Hoogsteen pair is a well-known type of base pair (bp), named after the scientist who discovered it. As shown in the Figure below (left), in the Hoogsteen bp scheme, adenine uses its N7 (acceptor) and N6 (donor) atoms at the major groove edge to form two H-bonds with the N3 (donor) and O4 (acceptor) atoms from uracil, respectively. Interestingly, if the uracil base ring is flipped around the N7(A)…N3(U) H-bond by 180 degrees, N6(A) now forms an H-bond with O2(U), i.e., N6(A)…O2(U): this pairing scheme is called the reverse Hoogsteen bp (right).

Hoogsteen and reverse Hoogsteen base pairs

I first knew about the Hoogsteen bp from Saenger’s book titled “Principles of Nucleic Acid Structure”. My knowledge of the Hoogsteen bp deepened as I tried to categorize different types of bps, especially in RNA-containing structures, in a consistent and rigorous computational framework. Thus, in the 3DNA NAR03 publication, we discussed specifically the bp (M+N type) and compared it with the A·U Watson-Crick bp (M–N type), as shown in the Figure below:

Watson-Crick vs Hoogsteen base pairs

Antiparallel and parallel combinations of adenine (A) and uracil (U) base pair ‘faces’: (a) the antiparallel Watson–Crick A–U pair with opposing faces (shaded versus unshaded) and a 1.5 Å Stretch introduced to separate the two base reference frames; (b) the parallel Hoogsteen A+U pair with base pair faces of the same sense. Black dots on bases denote the C1′ atoms on the attached sugars.

However, only recently did I read the two original publications by Hoogsteen:

  • The two-page long preliminary report, titled The structure of crystals containing a hydrogen-bonded complex of 1-methylthymine and 9-methyladenine [Acta Cryst. (1959). 12, pp.822-3]. It contains only a single reference, i.e. the 1953 Watson-Crick DNA structure Nature paper. Reading carefully through the two pages, I know why Hoogsteen used the methylated derivatives of thymine and adenine, and how the failed initial interpretation of the experimental “vector-density map” based on the Watson-Crick A-T bp led to the discovery of the new base-pairing scheme:

The fact that the first trial structure could not be refined led to a more critical scrutiny of the generalized projection and a greater emphasis on the significance of certain spurious peaks and on relatively large variations in the heights of peaks that were assumed to represent atoms. The correct structure was finally discovered by changing the positions of a few atoms in the 9-methyladenine portion of the asymmetric unit.

I enjoyed reading these two papers a lot. More generally, I like such focused articles where authors get directly to a point and addressed it thoroughly and clearly.

As a side note, the term Hoogsteen “edge” appears frequently in nowaday’s publications of RNA structures: in the Leontis-Westhof bp classification scheme, this term simply means the major groove edge in what would be a Watson-Crick bp geometry.



Rectangular block expressed in MDL molfile format

3DNA, following SCHNArP, uses the alchemy file format for the schematic base-pair rectangular block representation. Alchemy is a simple molecular file format, suitable for chemical compounds by specifying atom positions and bond linkages explicitly. By checking a sample alchemy file (here for drug aspirin), scientists with chemistry knowledge should have little problem in figuring out what each field means. As it happens, the 3DNA alchemy representation of the base-pair rectangular block is much simpler than that of a typical chemical compound (e.g., aspirin). No different partial atomic charges or atom types, no distinction between single-, double- or aromatic bond types, the base-pair block can be specified with uniform pseudo-atoms (nodes) and pseudo-bonds (edges). Apart from being simple, alchemy was one of the common file formats supported by RasMol — that’s the pragmatic reason why I adopted the format in SCHNArP and 3DNA.

Over the years, 3DNA has been continuously using the alchemy format for base and base-pair rectangular blocks. It forms the basis of the Calladine-Drew style schematic representation images in PostScript (.eps), Xfig (.fig) and Raster3d (.r3d) formats. However, outside 3DNA, the alchemy format is not widely supported by popular molecular graphics programs, including RasMol, Jmol and PyMOL:

  • RasMol v2.6.4, from Roger Sayle (the original author of RasMol), is mostly fine, except that the -noconnect option should be specified. As noted in the 3DNA Nature Protocols (2008) paper, The option ‘-noconnect’ makes sure that RasMol uses only the linkage information specified in the Alchemy file (by setting the CalcBondsFlag to false). … The … Alchemy files [can] contain explicitly specified coordinate axes, which would interfere with the default bond-calculation algorithm in RasMol.
  • RasMol v2.7.x has a bug in displaying alchemy files.
  • Jmol begins to support the alchemy format as of 11.7.18 (December 2008), following my request [see initial discussion and follow-up].
  • PyMOL does not recognize the alchemy format.

To make the schematic base-pair rectangular block representation more broadly accessible, I have recently added the -mol option to alc2img in 3DNA v2.1 to readily convert an alchemy file to the well-documented and widely supported MDL molfile format. The usage is very simple — take the standard base-pair rectangular block file (Block_BP.alc) as an example, the conversion can be performed as below:

alc2img -mol Block_BP.alc Block_BP.mol
alc2img -molv3000 Block_BP.alc Block_BP_v3000.mol

Note the followings:

  • By default, the -mol option converts alchemy to V2000 molfile format. However, if the number of atoms/bonds is greater then 999, the extended V3000 molfile format is used.
  • The V3000 molfile format can be explicitly specified with -molv3000 (or -mol3), as shown above.
  • Only V2000 molfile is consistently supported by RasMol, Jmol and PyMOL. On the other hand, while Jmol recognizes V3000 molfile, RasMol and PyMOL do not.
  • For reference, the three files — Block_BP.alc, Block_BP.mol, and Block_BP_v3000.mol — are enclosed below.

Content of ‘Block_BP.alc’

   12 ATOMS,    12 BONDS
    1 N      -2.2500   5.0000   0.2500
    2 N      -2.2500  -5.0000   0.2500
    3 N      -2.2500  -5.0000  -0.2500
    4 N      -2.2500   5.0000  -0.2500
    5 C       2.2500   5.0000   0.2500
    6 C       2.2500  -5.0000   0.2500
    7 C       2.2500  -5.0000  -0.2500
    8 C       2.2500   5.0000  -0.2500
    9 C      -2.2500   5.0000   0.2500
   10 C      -2.2500  -5.0000   0.2500
   11 C      -2.2500  -5.0000  -0.2500
   12 C      -2.2500   5.0000  -0.2500
    1     1     2
    2     2     3
    3     3     4
    4     4     1
    5     5     6
    6     6     7
    7     7     8
    8     5     8
    9     9     5
   10    10     6
   11    11     7
   12    12     8

Content of ‘Block_BP.mol’ (V2000)

XL 3DNAv2 
Converted from Alchemy format: Thu May  3 23:35:20 2012
 12 12  0     0  0              1 V2000
   -2.2500    5.0000    0.2500 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500   -5.0000    0.2500 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500   -5.0000   -0.2500 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500    5.0000   -0.2500 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.2500    5.0000    0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.2500   -5.0000    0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.2500   -5.0000   -0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.2500    5.0000   -0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500    5.0000    0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500   -5.0000    0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500   -5.0000   -0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2500    5.0000   -0.2500 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0
  2  3  1  0  0  0
  3  4  1  0  0  0
  4  1  1  0  0  0
  5  6  1  0  0  0
  6  7  1  0  0  0
  7  8  1  0  0  0
  5  8  1  0  0  0
  9  5  1  0  0  0
 10  6  1  0  0  0
 11  7  1  0  0  0
 12  8  1  0  0  0

Content of ‘Block_BP_v3000.mol’ (V3000)

XL 3DNAv2 
Converted from Alchemy format: Thu May  3 23:22:04 2012
  0  0  0     0  0            999 V3000
M  V30 COUNTS    12    12 0 0 0
M  V30     1 N     -2.2500     5.0000     0.2500 0
M  V30     2 N     -2.2500    -5.0000     0.2500 0
M  V30     3 N     -2.2500    -5.0000    -0.2500 0
M  V30     4 N     -2.2500     5.0000    -0.2500 0
M  V30     5 C      2.2500     5.0000     0.2500 0
M  V30     6 C      2.2500    -5.0000     0.2500 0
M  V30     7 C      2.2500    -5.0000    -0.2500 0
M  V30     8 C      2.2500     5.0000    -0.2500 0
M  V30     9 C     -2.2500     5.0000     0.2500 0
M  V30    10 C     -2.2500    -5.0000     0.2500 0
M  V30    11 C     -2.2500    -5.0000    -0.2500 0
M  V30    12 C     -2.2500     5.0000    -0.2500 0
M  V30     1 1    1    2
M  V30     2 1    2    3
M  V30     3 1    3    4
M  V30     4 1    4    1
M  V30     5 1    5    6
M  V30     6 1    6    7
M  V30     7 1    7    8
M  V30     8 1    5    8
M  V30     9 1    9    5
M  V30    10 1   10    6
M  V30    11 1   11    7
M  V30    12 1   12    8



Difference in shear of neighboring base pairs affects twist angle

In the standard base reference frame report, a whole section is devoted to the discussion of intrinsic correlations between base-pair and dimer step parameters (see figure below). Among the four sets of associations, the effect of Δbuckle (difference in consecutive base-pair buckles) on rise is most noticeable and easiest to understand. The Δshear vs. twist relationship is similarly significant, due to its close connection to the wobble G–T/G–U pair; yet the concept is less comprehensible, especially to occasional 3DNA users. This post aims to address the issue of how Δshear effects twist.

intrinsic correlations of base-pair and dimer step parameters

Under the standard base reference frame used in 3DNA, the wobble base-pair has a ~2.0 Å shear: the displacement is positive for U–G, and negative for G–U [see figure below, examples selected from 5S rRNA (chain 9) U82–G100 and G83–U99 of the Haloarcula marismortui large ribosomal subunit, PDB id: 1jj2].

U-G wobble pair with a positive shear
G-U wobble pair with a negative shear

As noted in the section “treatment of non-Watson–Crick base pairing motifs” of the 3DNA Nucleic Acids Research paper (2003), “Large Shear of the G–U wobble base pair influences the calculated but not the ‘observed’ Twist. The 3DNA numerical values of Twist [of the C7G8·U12G13 and G8C9·G11U12 dimer steps of the Escherichia coli tRNAAsp x-ray crystal structure (PDB id: 485d)], 20° (top) and 43° (bottom), differ from the visualization of nearly equivalent Twist suggested by the angle between successive C1′···C1′ vectors (finely dotted lines).”

Large shear of the G–U wobble base pair influences 3DNA-calculated twist angle.

To make it clear why that’s the case, the figure below shows a G–U wobble pair in atomic representation (top), and a schematic base pair rectangular block of dimension 10×5 (Å, bottom). A shear of –2 Å moves U upwards, as outlined by the dashed rectangle, and causes a ‘misalignment’ of 11.3° between the C1′···C1′ vector (red dotted line) and the base-centered mean y-axis (horizontal line):

atan2(2, 10) * 180 / pi = 11.3°

To a first order approximation, that is the difference in twist angle. So whenever a wobble pair is next to a normal Watson-Crick pair, there would be a ~11° “observed” discrepancy with 3DNA calculated twist angle. Moreover, when a G–U wobble is next to a U–G wobble pair or vice versa, the difference would be doubled to ~22°.

how large shear effects 3DNA calculated twist angle.



blocview: a simple, effective visualization tool for DNA/RNA structures

The blocview script in 3DNA has been created as a handy tool to effectively reveal key features of small to medium-sized nucleic acid structures. Specifically, the bloc part of the name means ‘block’, i.e., the rectangular block in Calladine-Drew style schematic representation to distinguish bases by size (larger purine vs. smaller pyrimidine), identity (red for A, yellow for C, green for G, and blue for T), and groove (minor edge in black). The view part stands for the most extended view, as defined by the principal axes of inertia. Implementation-wise, blocview calls several 3DNA utility programs and MolScript (for protein ribbons and nucleic acid backbone rods) to prepare the scenes, and then uses Raster3D (specifically, render) or PyMol to generate a PNG image.

The blocview script was originally written in Perl. As of 3DNA v2.1, I decided to switch the scripting language to Ruby for its consistent object-oriented style, succinct and flexible syntax. Previously available Perl scripts are now moved out of the default 3DNA executable directory $X3DNA/bin/ into $X3DNA/perl_scripts/. The blocview script has been re-written in Ruby and set as the default (at $X3DNA/bin/blocview); the original Perl version is renamed (at $X3DNA/perl_scripts/ to avoid confusion. The command line help message, available via blocview -h, is as below:

Generate a schematic image which combines base block representation
with protein ribbon. The image has informative color coding for the
nucleic acid part and is set in the "best view" by default. Raster3D
(or PyMOL) and ImageMagick must be installed.

        blocview [options] PDBFile
        blocview -i 355d.png 355d.pdb
             # generate image '355d.png'; display 355d.png
     --imgfile, -i <s>:   name of image file (default: blocview.png)
     --r3dfile, -r <s>:   name of .r3d file (default: blocview.r3d)
   --dpi-pymol, -d <i>:   create PyMOL ray-traced image at specific DPI
       --scale, -s <f>:   set scale factor (for 'render' of Raster3D)
        --xrot, -x <f>:   rotation angle about x-axis
        --yrot, -y <f>:   rotation angle about y-axis
        --zrot, -z <f>:   rotation angle about z-axis
        --original, -o:   use original coordinates
  --ball-and-stick, -b:   get a ball-and-stick image
     --p-base-ring, -c:   use only P and base ring atoms
           --no-ds, -n:   do not show double-helix ribbon
         --protein, -p:   set best view based on protein atoms
             --all, -a:   set best view based on all atoms
         --version, -v:   Print version and exit
            --help, -h:   Show this message

Using the x-ray crystal structure of d(GGCCAATTGG) complexed with netropsin (1z8v) in the minor groove as an example, the command to run is as follows:

blocview -i 1z8v.png 1z8v.pdb
# The following two forms are also fine
#     blocview --imgfile 1z8v.png 1z8v.pdb
#     blocview --imgfile=1z8v.png 1z8v.pdb
# The Perl version can be run like this:
#     $X3DNA/perl_scripts/ -i=1z8v.png 1z8v.pdb 

The image, named 1z8v.png, is shown below. Note that it is generated automatically from the PDB-formatted data file 1z8v.pdb. In this representation, one can see clearly that there are two unpaired Gs (green block) at each 5′-end of the two DNA chains (red and yellow rods), and a drug molecule (ball-and-stick) binds in the minor groove (black edge of the rectangular blocks). Moreover, the deformation in propeller and buckle is obvious in this schematic presentation.

blocview-generated schematic image for 1z8v

Over the years, blocview-generated images have been used in NDB for virtually all nucleic acid structures (see for example, the NDB atlas gallery for x-ray drug-DNA complexes). It’s worth noting that such simple images have also be adopted by the RCSB PDB, prominently at the summary page, for nucleic acid containing structures (see PDB entry 1z8v). Given the effectiveness of blocview-generated schematic representation and its adoption by the NDB and PDB, I’m hopeful that blocview will be more widely used by the general DNA/RNA structure community. As always, I value user’s feedback in continuously refining the script.



Rectangular block representation of base-pairs in 3DNA

One of 3DNA’s unique features is the simplified rectangular block representation of bases and base-pairs, as shown in the figure below. This type of schematic depiction was first made popular by Calladine and Drew (see their book titled Understanding DNA — The Molecule & How It Works), thus I usually call it the Calladine-Drew style representation.

rectangular blocks used in 3DNA

By default, a base-pair [BP, (a)] has dimensions of 10×4.5×0.5 (Å); a purine [R, (b) left] 4.5×4.5×0.5 (Å); a pyrimidine [Y, (b) right] 3×4.5×0.5 (Å); and a mean base [M, (c)], which is exactly half of the base-pair, 5×4.5×0.5 (Å).

The blocks are stored into separate files: Block_BP.alc, Block_R.alc, and Block_Y.alc for BP, R and Y respectively. To use M for R and Y (i.e., set R and Y to be of equal size), simple copy file Block_M.alc to overwrite Block_R.alc and Block_Y.alc in the current working directory for local effect, or the 3DNA installation directory ($X3DNA/config/) for global impact. These blocks are used in the rebuilding and visualization components of 3DNA.

Following SCHNArP, 3DNA uses alchemy, a simple chemical file format, to specify explicitly the nodes (atoms) and edges (bonds) of a rectangular block. Three file formats (alchemy, MDL molfile, and Tripos mol2), supported by RasMol v2.6 (the most popular molecular graphics visualization program in the 1990s), serve the purpose of specifying the rectangular block. I cannot recall exactly why I picked up -alchemy instead of -mdl and -mol2, perhaps because of its simplicity: I played around with sample alchemy files and came up with the alchemy rectangular block files used by SCHNArP, without much difficulty.

As an example, Block_BP.alc has the following content:

   12 ATOMS,    12 BONDS
    1 N      -2.2500   5.0000   0.2500
    2 N      -2.2500  -5.0000   0.2500
    3 N      -2.2500  -5.0000  -0.2500
    4 N      -2.2500   5.0000  -0.2500
    5 C       2.2500   5.0000   0.2500
    6 C       2.2500  -5.0000   0.2500
    7 C       2.2500  -5.0000  -0.2500
    8 C       2.2500   5.0000  -0.2500
    9 C      -2.2500   5.0000   0.2500
   10 C      -2.2500  -5.0000   0.2500
   11 C      -2.2500  -5.0000  -0.2500
   12 C      -2.2500   5.0000  -0.2500
    1     1     2
    2     2     3
    3     3     4
    4     4     1
    5     5     6
    6     6     7
    7     7     8
    8     5     8
    9     9     5
   10    10     6
   11    11     7
   12    12     8

Observant viewers may notice that nodes 1-4 are specified as nitrogens (N) which have exactly the same coordinates as 9-12 (carbons, C). This is a little trick to make RasMol display the minor groove edge in a different color (blue for N) than the other five sides of the rectangular (gray for C), as shown in the following figure:

base-pair rectangular block

Note that the rectangular is preset in the standard base reference frame. Thus the nodes have y-coordinates of +5 Å and -5 Å along the long edge of the base pair, and x-coordinates of +2.25 Å and -2.25 Å along the short edge.

As an extra bonus of storing the rectangular blocks in external alchemy text files, the dimensions of the blocks can be readily changed. For example, the thickness of a block (z-coordinates) can be easily increased from 0.5 to 1.0 Å to make it thicker. Moreover, the blocks do not need to be rectangular either — they can appear to be triangular blocks.

It’s worth noting that while extensively used in 3DNA for schematic representations, the alchemy format has largely become a legacy in cheminformatics/bioinformatics nowadays. Searching the internet, I cannot find the specification of the format. Moreover, the support of alchemy is quite limited and buggy in molecular graphics visualization programs most widely used today: PyMOL does not understand this format at all; RasMol v2.7 has a bug in interpreting it; only Jmol can properly read 3DNA base-pair rectangular block files in alchemy [see initial discussion and follow-up]. To resolve the issues associated with alchemy format, and thus to make 3DNA base-pair block schematics more widely available, I have recently added a converter in v2.1 to readily transform alchemy to MDL molfile, a format consistently supported by PyMOL, Jmol and RasMol. I’ll talk about this feature in another post.



Two slightly different definitions of sugar pucker

The conformation of the five-membered sugar ring in DNA/RNA structures can be characterized by the five consecutive endocyclic torsion angles (see Figure below), i.e.,

ν0: C4′-O4′-C1′-C2′
ν1: O4′-C1′-C2′-C3′
ν2: C1′-C2′-C3′-C4′
ν3: C2′-C3′-C4′-O4′
ν4: C3′-C4′-O4′-C1′

Due to the ring constraint, the conformation can be characterized approximately by 5-3=2 parameters. Using the concept of pseudorotation of the sugar ring, the two parameters are the amplitude (τm) and phase angle (P).

One set of widely used formula to convert the five torsion angles to the pseudorotation parameters is due to Altona & Sundaralingam (1972): Conformational Analysis of the Sugar Ring in Nucleosides and Nucleotides. A New Description Using the Concept of Pseudorotation [J. Am. Chem. Soc., 94(23), pp. 8205–8212]. The concept is easily illustrated with an example — here with the G4 sugar ring on chain A of the Dickerson dodecamer (1bna), using Octave/Matlab code:

# xyz coordinates of the G4 sugar ring on chain A of 1bna
# ATOM     63  C4'  DG A   4      21.393  16.960  18.505  1.00 53.00
# ATOM     64  O4'  DG A   4      20.353  17.952  18.496  1.00 38.79
# ATOM     65  C3'  DG A   4      21.264  16.229  17.176  1.00 56.72
# ATOM     67  C2'  DG A   4      20.793  17.368  16.288  1.00 40.81
# ATOM     68  C1'  DG A   4      19.716  17.901  17.218  1.00 30.52

# endocyclic torsion angles:
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4;
Pconst = sin(pi/5) + sin(pi/2.5);  # 1.5388
P0 = atan2(v4 + v1 - v3 - v0, 2.0 * v2 * Pconst); # 2.9034
tm = v2 / cos(P0)  # amplitude: 48.469
P = 180/pi * P0  # phase angle: 166.35 [P + 360 if P0 < 0]

The Altona & Sundaralingam (1972) pseudorotation parameters are what have been adopted in 3DNA. The Curves+ program, on the other hand, uses another set of formula due to Westhof & Sundaralingam (1983): A Method for the Analysis of Puckering Disorder in Five-Membered Rings: The Relative Mobilities of Furanose and Proline Rings and Their Effects on Polynucleotide and Polypeptide Backbone Flexibility. [J. Am. Chem. Soc., 105(4), pp. 970–976]. The two sets of formula — Altona & Sundaralingam (1972) and Westhof & Sundaralingam (1983) — give slightly different numerical values for the two pseudorotation parameters (see below).

Since Curves+ and 3DNA are currently the most commonly used programs for conformational analysis of nucleic acid structures, the subtle differences in these two pseudorotation parameters may cause confusions for users who use both programs. With the same G4 (on chain A of 1bna) sugar ring, here is the Octave/Matlab script showing how Curve+ calculates the pseudorotation parameters:

# xyz coordinates of the G4 sugar ring on chain A of 1bna
# endocyclic torsion angles, same as above
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4;

v = [v2, v3, v4, v0, v1]; # reorder them into vector v[]
A = 0; B = 0;
for i = 1:5
    t = 0.8 * pi * (i - 1);
    A += v(i) * cos(t);
    B += v(i) * sin(t);
A *= 0.4;   # -48.476
B *= -0.4;  # 11.516

tm = sqrt(A * A + B * B);  # 49.825

c = A/tm; s = B/tm;
P = atan2(s, c) * 180 / pi;  # 166.64

For this specific example, i.e., the G4 sugar ring on chain A of 1bna, the pseudorotation parameters as calculated by 3DNA following Altona & Sundaralingam (1972) and Curves+ following Westhof & Sundaralingam (1983) are as follows:

amplitude (τm) phase angle (P)
3DNA 48.469 166.35
Curves+ 49.825 166.64

Needless to say, for the majority of cases like the one shown here, the differences are subtle; very few people would notice them or be bothered at all. For those who do care about such little details, however, this post shows where the discrepancies really come from.



Building a bridge between Curves+ and 3DNA

In the field of nucleic acid structural analysis, it seems fair to say that Curves+ and 3DNA are nowadays the top two choices. To the best of my knowledge, these two programs are also the only ones that confirm to the standard base reference frame. Moreover, as noted in my previous post, Curves+ and 3DNA are “constructive competitors” with complementary functionality: Curves+ is unique in providing a curvilinear helical axis, a bending analysis, a full description of groove widths and depths and its seamless integration to the analysis of molecular dynamics trajectories, while 3DNA’s strength lies in its cohesrent approach combining analysis, rebuilding, and visualization into one package.

Given the complementarity between Curve+ and 3DNA, it makes sense to build a ‘bridge’ between the two so users can easily take advantage of both programs. Starting from 3DNA v1.5, find_pair has the -c option to generate input for Curves directly from a PDB file. Over the years, this option appears to have received little attention — at least, I am not aware of any literature reference to it. Now, the updated Curves+ program has introduced the new lib name list variable, among other changes. I have thus added the -curves+ option (abbreviation -c+) to find_pair to make its output compatible with Curves+.

As always, the point/process is best illustrated with an example — here with the Dickerson B-DNA dodecamer solved at high resolution by Williams et al. (PDB entry 355d).

find_pair -c+ 355d.pdb 355d-curves+.inp

The generated file 355d-curves+.inp has the following content:

&inp file=355d.pdb,
    2    1   -1    0    0
    1    2    3    4    5    6    7    8    9   10   11   12
   24   23   22   21   20   19   18   17   16   15   14   13

which can be fed into Curves+ as below,

Cur+ < 355d-curves+.inp

The four output files are: 355d.cda, 355d.lis, 355d_X.pdb and 355d_b.pdb.

Please note the followings:

  • The environment variable CURVES_PLUS_STDLIB should be set, pointing to the directory where Curves+ is installed (containing files standard_b.lib and standard_s.lib). In the example above, CURVES_PLUS_STDLIB is set to /Users/xiangjun/Curves+.
  • The find_pair -c+ option (currently) is applicable only to double helical DNA/RNA structures, the most common application scenario.
  • The -c+ option ignores HETATM records, in accordance with Curves+ where proteins, water and HETATM are automatically removed at input. (see Curves+ user manual, section Input data)
  • To run Cur+ < 355d-curves+.inp again in the same folder, the four output files must first be deleted (e.g., rm -f 355d.cda 355d.lis 355d_[Xb].pdb). This is best taken care of via a script.

Obviously, the nucleic acid structure community benefits the most to have both Curves+ and 3DNA at its disposal and be able to easily switch between them — hopefully, the find_pair -c+ option would serve as such a ‘bridge’.

Comment [2]


Torsion angles of nucleic acid structures

From the very beginning, 3DNA calculates a set of nucleic acid backbone parameters, including the six main chain torsion angles (α, β, γ, δ, ε, and ζ) around the covalent bonds, χ about the glycosidic bond, and the sugar pucker (see figure below). For double helical structures, the standard analyze output (.out file) has a section for “Main chain and chi torsion angles,” and another dedicated to “Sugar conformational parameters”. Based on my experience/understanding, these two parts are well recognized and utlizied by 3DNA users. What has receive little attention (in spite of the several posts I’ve written on the topic), though, is 3DNA’s applicability to single-stranded (ss) RNA structures for the backbone torsions, among other parameters. Using the fully refined crystal structure of the Haloarcula marismortui large ribosomal subunit (PDB entry 1jj2) as an example, the procedure is below:

find_pair -s 1jj2.pdb 1jj2.nts
analyze 1jj2.nts
    # or the above two steps can be combined:
find_pair -s 1jj2.pdb stdout | analyze stdin
    # see output file '1jj2.outs'

In retrospect, the fact that 3DNA has been little used for RNA backbone conformational analysis is of no surprise:

  • While base-pair parameters have different (oftentimes confusing) definitions, these backbone parameters are pretty “standard” — thus, for example, any program for DNA/RNA structural analysis would give the same numerical values for α or χ torsion angles.
  • The two-step process as illustrated above is a bit awkward, and the torsions are “buried” among many other parameters.
  • 3DNA is more directly “linked” (conceivably) to DNA base pairs than to RNA backbone.

So while adapting the Zp parameter for ss DNA/RNA structures in 3DNA v2.1, I also take this opportunity to add the -torsion option to analyze with the following handy features:

  • Streamline the calculation by starting directly from a PDB file and output only backbone parameters. So the above example can be shortened to analyze -t=1jj2.tor 1jj2.pdb; the output file is named 1jj2.tor.
  • Classify backbone into BI/BII conformation, and base χ into syn / anti.
  • Add pseudo-torsions, and Zp and Dp as defined by Richardson et al.
  • Handle pseudouridine sensibly, and work also for nucleic acid structure with only backbone atoms.
  • Be easy to use, efficient and robust — it takes ~1 second to process the large ribosomal subunit 1jj2 (with 2876 nucleotides consisting of 23S rRNA and 5S rRNA) on my MacBook Air.

Overall, analyze -torsion is designed to be pragmatic and allows for automatic processing of all NDB entries or molecular dynamics trajectories. Given below is an excerpt of the three sections from an analyze -torsion run on 1jj2:

Main chain and chi torsion angles: 

Note: alpha:   O3'(i-1)-P-O5'-C5'
      beta:    P-O5'-C5'-C4'
      gamma:   O5'-C5'-C4'-C3'
      delta:   C5'-C4'-C3'-O3'
      epsilon: C4'-C3'-O3'-P(i+1)
      zeta:    C3'-O3'-P(i+1)-O5'(i+1)

      chi for pyrimidines(Y): O4'-C1'-N1-C2
          chi for purines(R): O4'-C1'-N9-C4

          If chi is in range [-90, +90], syn conformation
              otherwise, it is in anti conformation

          e-z: epsilon - zeta
              BI:  e-z = [-160, +20]
              BII: e-z = [+20, +200]

              base      chi         alpha    beta   gamma   delta  epsilon   zeta     e-z
   1 0:..10_:[..U]U   -62.5(syn)     ---     ---     56.2    74.0   142.2   -87.8  -130.1(BI) 
   2 0:..11_:[..A]A   171.5(anti)   173.2  -161.0   168.5    84.0  -112.1   -65.4   -46.7(BI) 
   3 0:..12_:[..U]U  -167.7(anti)   -70.7   168.4    53.0    78.5  -128.5   -46.4   -82.1(BI) 
   4 0:..13_:[..G]G  -172.5(anti)   -61.8   170.4    67.7    73.5  -166.7   -79.6   -87.1(BI) 
   5 0:..14_:[..C]C  -166.0(anti)   -73.0  -172.5    55.1    83.2  -143.3   -77.7   -65.6(BI) 
   6 0:..15_:[..C]C  -155.5(anti)   -60.9   174.1    47.3    80.3  -154.4   -71.2   -83.2(BI) 

Pseudo (virtual) eta/theta torsion angles:

Note: eta:    C4'(i-1)-P(i)-C4'(i)-P(i+1)
      theta:  P(i)-C4'(i)-P(i+1)-C4'(i+1)

      eta':   C1'(i-1)-P(i)-C1'(i)-P(i+1)
      theta': P(i)-C1'(i)-P(i+1)-C1'(i+1)

      eta":   Borg(i-1)-P(i)-Borg(i)-P(i+1)
      theta": P(i)-Borg(i)-P(i+1)-Borg(i+1)

              base      eta   theta    eta'  theta'    eta"  theta"
   1 0:..10_:[..U]U    ---     ---     ---     ---     ---     --- 
   2 0:..11_:[..A]A  -174.6  -129.7   177.0  -127.7  -157.5   -75.5
   3 0:..12_:[..U]U   149.1  -105.1   174.1  -101.2  -111.0   -69.4
   4 0:..13_:[..G]G   169.0  -172.5  -156.6  -169.2   -93.3  -137.1
   5 0:..14_:[..C]C   176.2  -143.4   179.6  -140.6  -144.6  -120.6
   6 0:..15_:[..C]C   165.0  -147.7   177.4  -146.8  -149.2  -121.7

Sugar conformational parameters: 

Note: v0: C4'-O4'-C1'-C2'
      v1: O4'-C1'-C2'-C3'
      v2: C1'-C2'-C3'-C4'
      v3: C2'-C3'-C4'-O4'
      v4: C3'-C4'-O4'-C1'

      tm: the amplitude of pucker
      P:  the phase angle of pseudorotation

      Zp: z-coordinate of the 3' phosphorus atom (P) expressed in the
            standard base reference frame; it's POSITIVE when P is on
            the +z-axis side (base in anti conformation); NEGATIVE if
            P is on the -z-axis side (base in syn conformation)
      Dp: perpendicular distance of the 3' P atom to the glycosydic bond
            [as per the MolProbity paper of Richardson et al. (2010)]

              base       v0      v1      v2      v3      v4     tm       P    Puckering    Zp      Dp
   1 0:..10_:[..U]U   -11.3   -15.4    34.5   -41.8    33.5    41.6    33.8    C3'-endo   -0.13    3.53
   2 0:..11_:[..A]A    11.4   -30.2    36.9   -31.5    12.6    36.9     1.2    C3'-endo    4.74    4.78
   3 0:..12_:[..U]U     3.6   -29.3    42.4   -41.3    23.8    43.6    13.9    C3'-endo    4.67    4.82
   4 0:..13_:[..G]G   -13.0   -17.8    39.8   -47.9    38.5    47.8    33.7    C3'-endo    4.45    4.46
   5 0:..14_:[..C]C     6.0   -28.4    38.9   -36.5    19.2    39.5    10.1    C3'-endo    4.57    4.70
   6 0:..15_:[..C]C     1.9   -26.4    39.6   -39.6    23.7    41.2    16.0    C3'-endo    4.32    4.61



How to calculate torsion angle?

Given the x-, y-, and z-coordinates of four points (a-b-c-d) in 3-dimensional (3D) space, how to calculate the torsion angle? Overall, this is a well-solved problem in structural biology and chemistry; one can find a description of torsion angle in many text books and on-line documents. The algorithm for its calculation is implementated in virtually every software package in computational structural biology and chemistry.

As basic as the concept is, however, it is important (based on my experience) to have a clear understanding of how torsion angle is defined in order to really get into the 3D world. Here is a worked example using Octave/Matlab of my simplified, geometry-based implementation of calculating torsion angle, including how to determine its sign. No theory or (complicated) mathematical formula, just a step-by-step illustration of how I solve this problem.

  • Coordinates of four points are given in variable abcd:
abcd = [ 21.350  31.325  22.681
         22.409  31.286  21.483
         22.840  29.751  21.498
         23.543  29.175  22.594 ];
  • Two auxiliary functions: norm_vec() to normalize a vector; get_orth_norm_vec() to get the orthogonal component (normalized) of a vector with reference to another vector, which should have already been normalized.
function ovec = norm_vec(vec)
    ovec = vec / norm(vec);
function ovec = get_orth_norm_vec(vec, vref)
    temp = vec - vref * dot(vec, vref);
    ovec = norm_vec(temp);
  • Get three vectors: b_c is the normalized vector b→c; b_a_orth is the orthogonal component (normalized) of vector b→a with reference to b→c; c_d_orth is similarly defined, as the orthogonal component (normalized) of vector c→d with reference to b→c.
b_c = norm_vec(abcd(3, :) - abcd(2, :))
    % [0.2703158  -0.9627257   0.0094077]
b_a_orth = get_orth_norm_vec(abcd(1, :) - abcd(2, :), b_c)
    % [-0.62126  -0.16696   0.76561]
c_d_orth = get_orth_norm_vec(abcd(4, :) - abcd(3, :), b_c)
    % [0.41330   0.12486   0.90199]
  • Now the torsion angle is defined as the angle between the two vectors, b_a_orth and c_d_orth, and can be easily calculated by their dot product. The sign of the torsion angle is determined by the relative orientation of the cross product of the same two vectors with reference to the middle vector b→c. Here they are in opposite direction, thus the torsion angle is negative.
angle_deg = acos(dot(b_a_orth, c_d_orth)) * 180 / pi  % 65.609
sign = dot(cross(b_a_orth, c_d_orth), b_c)  % -0.91075
if (sign < 0)
    ang_deg = -angle_deg  % -65.609

A related concept is the so-called dihedral angle, or more generally the angle between two planes. As long as the normal vectors to the two corresponding planes are defined, the angle between them is easy to work out.

It’s worth noting that the helical twist angle in SCHNAaP and 3DNA is calculated similarly.



Pseudo-torsions to simplify the representation of DNA/RNA backbone conformation

Backbone conformation of nucleic acid structures is most characterized by a set of 6 torsion angles (α, β, γ, δ, ε, and ζ) around the consecutive chemical bonds, chi (χ) quantifying the relative base/sugar orientation, plus the sugar pucker.

This large number of DNA/RNA backbone conformational parameters is in striking contrast to the two torsion angles (φ and ψ) in protein structures, routinely employed in Ramachandran plot. Over the years, the nucleic acid community has come up with simplified ways to represent DNA/RNA backbone conformation. Thus far, the most widely used one is the pseudo-torsion angles (See figure below) η: C4′(i-1)-P(i)-C4′(i)-P(i+1) and θ: P(i)-C4′(i)-P(i+1)-C4′(i+1).

The history of the P—C4′ virtual-bond concept and its application in RNA structure analysis have recently been reviewed by Pyle et al. in A new way to see RNA [Q Rev Biophys. 2011, 44(4), 433—466], where the following three contributions are highlighted:

  1. Olson (1980). Configurational statistics of polynucleotide chains. An updated virtual bond model to treat effects of base stacking., Macromolecules 13(3), 721—728.
  2. Malathi & Yathindra (1980). A novel virtual bond scheme to probe ordered and random coil conformations of nucleic acids: Configurational statistics of polynucleotide chains. Current Science, 49, 803—807.
  3. Duarte & Pyle (1998). Stepping through an RNA structure: A novel approach to conformational analysis. Journal of Molecular Biology, 284, 1465—1478.

More recently, Pyle et al. also employed a modified version of the pseudo-torsions, η′: C1′(i-1)-P(i)-C1′(i)-P(i+1) and θ′: P(i)-C1′(i)-P(i+1)-C1′(i+1), i.e., using C1′ instead of C4′, and found that:

The η′ and θ′ torsions are more suitable when interpreting crystallographic density because the C1′ atom is covalently bound to the nucleoside base and therefore can be more easily and accurately located within a low-resolution map.

While implementing the -torsion option to analyze to make it more explicit that 3DNA readily calculates conventional backbone torsion angles, I also take this opportunity to add the pseudo-torsion angles — η/θ and η′/θ′, among other new parameters. Moreover, while I am at it, I cannot help but also compute yet another set of pseudo-torsion angles: η″/θ″. Here, instead of C1′ or C4′, the origin of the base reference frame is employed; it can be taken as a pseudo-atom more accurately defined by the base plane than any real single atom.

The usefulness of η″/θ″, especially in comparison with η/θ and η′/θ′, remains to be determined. However, only η″/θ″ uniquely takes advantage of the two most accurately determined entities in a nucleic acid structure, the heavy phosphorus atom and the rigid base plane [see discussion (p.16) in the Richardson et al. MolProbity paper, Acta Cryst. (2010). D66, 12–21] Presumably, η″/θ″ provides a new perspective in RNA structural analysis by combining the backbone and the base.

Here is the pseudo-torsions for the yeast phenylalanine transfer RNA (6tna by simply running analyze -torsion=6tna.tor 6tna.pdb):

Pseudo (virtual) eta/theta torsion angles:

Note: eta:    C4'(i-1)-P(i)-C4'(i)-P(i+1)
      theta:  P(i)-C4'(i)-P(i+1)-C4'(i+1)

      eta':   C1'(i-1)-P(i)-C1'(i)-P(i+1)
      theta': P(i)-C1'(i)-P(i+1)-C1'(i+1)

      eta":   Borg(i-1)-P(i)-Borg(i)-P(i+1)
      theta": P(i)-Borg(i)-P(i+1)-Borg(i+1)

              base      eta   theta    eta'  theta'    eta"  theta"
   1 A:...1_:[..G]G    ---   -126.6    ---   -141.5    ---   -130.4
   2 A:...2_:[..C]C   167.8  -168.3   174.6  -152.5  -151.4  -115.4
   3 A:...3_:[..G]G   160.4  -119.8  -171.9  -138.9  -123.6  -119.2
   4 A:...4_:[..G]G   148.0  -164.2   162.1  -159.2  -154.4  -124.6
   5 A:...5_:[..A]A   168.7  -137.6  -175.9  -137.8  -129.5  -115.0
   6 A:...6_:[..U]U   171.8  -145.7  -172.5  -140.5  -131.3  -124.7
   7 A:...7_:[..U]U  -151.0   -47.8  -136.0   -58.6  -117.7   -30.2
   8 A:...8_:[..U]U   160.9   159.7  -161.0  -163.6  -144.2   178.0
   9 A:...9_:[..A]A  -137.0   -48.6  -158.1  -108.9   161.5  -104.7
  10 A:..10_:[2MG]g    33.1  -135.8    93.4  -134.6   134.1  -113.0
  11 A:..11_:[..C]C   167.2  -138.3  -179.4  -137.7  -142.4  -118.7
  12 A:..12_:[..U]U   165.5  -120.7  -179.3  -128.0  -145.8  -106.7
  13 A:..13_:[..C]C   174.1  -173.6  -165.5   179.6  -120.9  -180.0
  14 A:..14_:[..A]A   173.0  -144.0   172.7  -132.4   177.6   -72.7
  15 A:..15_:[..G]G   154.7   110.6  -176.2    85.5   -97.7   -76.9
  16 A:..16_:[H2U]u    76.3    94.1    65.3   119.7  -152.8  -123.8
  17 A:..17_:[H2U]u   -36.7   -79.6   -50.7  -136.6  -142.7  -159.0
  18 A:..18_:[..G]G    -9.7  -166.8    41.7  -158.6    28.9  -120.4
  19 A:..19_:[..G]G  -131.6   -35.8  -122.9   -67.8  -104.3   -10.5
  20 A:..20_:[..G]G   160.9   -93.2  -161.6   -98.9  -174.1  -112.3
  21 A:..21_:[..A]A   -83.6   152.5   -72.8   155.7   -59.1   155.4
  22 A:..22_:[..G]G   164.1   169.4   160.0  -178.5   159.1  -157.6
  23 A:..23_:[..A]A   177.6  -148.5  -174.5  -142.7  -154.5  -114.3
  24 A:..24_:[..G]G   167.2   -98.9  -171.7  -128.6  -127.6   -99.1
  25 A:..25_:[..C]C   151.6  -153.5   167.3  -140.8  -137.7   -84.8
  26 A:..26_:[M2G]g   156.2  -137.4  -175.2  -135.2  -100.0  -104.2
  27 A:..27_:[..C]C   166.2  -145.5  -177.9  -140.4  -129.1  -116.8
  28 A:..28_:[..C]C   164.7  -140.5   175.8  -145.3  -152.7  -123.4
  29 A:..29_:[..A]A   161.2  -145.3   175.7  -144.9  -142.0  -126.0
  30 A:..30_:[..G]G  -173.5  -120.3  -158.4  -133.2  -126.6   -94.4
  31 A:..31_:[..A]A   169.8  -153.1   177.7  -140.4  -124.5   -81.5
  32 A:..32_:[OMC]c   154.4  -126.8  -178.7  -131.3  -104.1  -128.0
  33 A:..33_:[..U]U   170.0  -103.9  -179.9  -152.7  -164.6   143.6
  34 A:..34_:[OMG]g    -4.7  -123.7    41.8  -124.8    31.6   -99.6
  35 A:..35_:[..A]A   163.5  -104.3   176.9  -127.9  -137.5  -128.2
  36 A:..36_:[..A]A   175.9   173.6   180.0  -167.7  -156.4  -118.3
  37 A:..37_:[.YG]g   166.8  -131.7  -174.5  -133.0  -115.1   -82.9
  38 A:..38_:[..A]A   167.7  -121.6  -175.7  -114.3  -109.9   -79.9
  39 A:..39_:[PSU]P   168.3  -146.8  -160.2  -146.4   -98.6  -116.5
  40 A:..40_:[5MC]c   160.6  -138.7   174.0  -141.8  -139.7  -126.5
  41 A:..41_:[..U]U   164.8  -161.4   175.9  -152.3  -150.5  -117.6
  42 A:..42_:[..G]G   174.3  -140.9  -170.3  -145.4  -129.1  -121.3
  43 A:..43_:[..G]G   169.6  -159.0  -176.2  -154.9  -133.7  -133.1
  44 A:..44_:[..A]A   174.0  -121.5  -174.2  -122.0  -143.1   -74.9
  45 A:..45_:[..G]G   174.4  -132.5  -166.2  -128.1  -101.8  -128.9
  46 A:..46_:[7MG]g  -112.8  -113.4  -127.2  -138.3  -139.8  -152.1
  47 A:..47_:[..U]U   -63.2   -53.8    -1.1   -92.0    22.8  -124.7
  48 A:..48_:[..C]C   -84.7    59.6   -20.1     8.9    19.3  -104.5
  49 A:..49_:[5MC]c   -56.8  -140.1   -29.9  -143.6    98.1  -125.4
  50 A:..50_:[..U]U   173.6  -146.4  -178.3  -140.6  -147.6  -117.8
  51 A:..51_:[..G]G   160.8  -148.1  -178.6  -150.7  -140.7  -121.9
  52 A:..52_:[..U]U   164.9  -144.0   175.8  -143.5  -139.9  -114.3
  53 A:..53_:[..G]G   168.2  -140.9  -171.1  -144.0  -121.6  -117.3
  54 A:..54_:[5MU]u   167.0  -131.1   178.3  -124.9  -139.9   -77.0
  55 A:..55_:[PSU]P   167.6  -114.2  -172.8  -155.6  -113.0   146.0
  56 A:..56_:[..C]C    35.0  -121.5    52.6  -126.2    26.5   -83.8
  57 A:..57_:[..G]G   168.4  -148.1  -177.1  -131.1  -115.4  -111.7
  58 A:..58_:[1MA]a  -136.3  -133.3  -106.5  -176.7  -105.3   149.6
  59 A:..59_:[..U]U    23.0  -130.9    33.0  -115.4    48.2   -68.2
  60 A:..60_:[..C]C  -163.6   -54.3  -123.2   -76.4   -79.6   -36.4
  61 A:..61_:[..C]C   125.5  -153.3   169.7  -144.7  -153.8  -123.4
  62 A:..62_:[..A]A   172.5  -139.3  -177.0  -137.6  -150.7  -114.6
  63 A:..63_:[..C]C   165.8  -146.6  -178.5  -149.8  -139.2  -127.8
  64 A:..64_:[..A]A   164.7  -144.9   176.5  -145.8  -145.3  -118.1
  65 A:..65_:[..G]G   170.4  -152.3  -175.5  -151.5  -132.3  -122.1
  66 A:..66_:[..A]A   168.0  -152.0  -177.4  -150.2  -133.0  -118.7
  67 A:..67_:[..A]A   170.9  -141.8  -178.4  -140.4  -134.8  -123.1
  68 A:..68_:[..U]U   164.8  -135.1  -178.9  -137.9  -143.7   -95.2
  69 A:..69_:[..U]U   168.2  -154.9  -174.3  -157.1  -112.2  -144.8
  70 A:..70_:[..C]C   160.6  -153.2   170.7  -153.5  -164.4  -125.1
  71 A:..71_:[..G]G   161.8  -144.3   172.1  -143.1  -145.7  -124.2
  72 A:..72_:[..C]C   176.7  -136.4  -169.3  -134.5  -134.9   -87.1
  73 A:..73_:[..A]A   160.6  -142.8  -179.7  -139.7  -112.8  -104.4
  74 A:..74_:[..C]C  -176.9  -115.9  -163.1  -115.4  -117.2   -68.7
  75 A:..75_:[..C]C   169.8    80.9  -170.0    74.9  -108.5   -91.3
  76 A:..76_:[..A]A    ---     ---     ---     ---     ---     --- 



Definition of the chi (χ) torsion angle for pseudouridine

In nucleic acid structures, the chi (χ) torsion angle is about the glycosidic bond (N-C1′) that connects the sugar and the A/C/G/T/U bases (or their modified variants). Specifically, for pyrimidines (C, T and U), χ is defined by O4′-C1′-N1-C2; and for purines (A and G) by O4′-C1′-N9-C4 (see figure below).

Pseudouridine (5-ribosyluracil, PSU) was the first identified modified nucleoside in RNA and is the most abundant. PSU is unique in that it has a C-glycosidic bond (C-C1′) instead of the N-glycosidic bond common to all other nucleosides, canonical or modified. It thus poses a problem as to how to calculate the χ torsion angle: should it be O4′-C1′-C5-C4, reflecting the actual glycosidic bond connection, or should the conventional definition O4′-C1′-N1-C2 still be applied literally? As a concrete example, the figure below shows the (slightly) different numerical values (–162.7° vs. –163.9°), as given by the two definitions, for PSU 6 on chain A of the PDB entry 3cgp (based on the 2009 Biochemistry article by Lin & Kielkopf titled X-ray structures of U2 snRNA-branchpoint duplexes containing conserved pseudouridines).

Needless to say, the specific definition of the χ torsion angle for PSU in RNA structures is a very subtle point, and I am not aware of any discussion on this issue in literature. In 3DNA, PSU is identified explicitly, and χ is defined by O4′-C1′-C5-C4. In NDB and a couple of other tools I am familiar with, χ for PSU is defined by O4′-C1′-N1-C2. Again using 3cgp (figure below) as an example, 3DNA gives –162.7°, whilst NDB gives –163.9°. Additionally, this distinction in N-C1′ vs. C-C1′ connection also comes into play when calculating the perpendicular distance from the 3′ phosphorus atom to the glycosidic bond, as per Richardson et al.



The chi (χ) torsion angle characterizes base/sugar relative orientation

Except for pseudouridine, a nucleoside in DNA/RNA contains an N-glycosidic bond that connects the base to the sugar. The chi (χ) torsion angle, which characterizes the relative base/sugar orientation, is defined by O4′-C1′-N1-C2 for pyrimidines (C, T and U), and O4′-C1′-N9-C4 for purines (A and G).

Normally (as in A- and B-form DNA/RNA duplex), χ falls into the ranges of +90° to +180°; –90° to –180° (or 180° to 270°), corresponding to the anti conformation (Figure below, top). Occasionally, χ has values in the range of –90° to +90°, referring to the syn conformation (Figure below, bottom). Note that in left-handed Z-DNA with CG repeating sequence, the purine G is in syn conformation whilst the pyrimidine C is anti.

Presumably, the χ-related anti / syn conformation is a simple geometric concept. Nevertheless, the N-glycosidic bond and the corresponding χ torsion angle illustrate that the base and the sugar are two separate entities, i.e. there is an internal degree of freedom between them. In this respect, it is worth noting that the Leontis-Westhod sugar edge for base-pair classification corresponds to the anti form (as applied to RNA) only. When a base is flipped over into the syn conformation, the “sugar edge”, defined in connection with the minor (shallow) groove side of a nitrogenous bases, simply does not exist.

Base-flipping (anti / syn conformation switch) is one of the factors associated with the two possible relative orientations of the two bases in a pair, characterized explicitly in 3DNA as of type M+N or M–N since the 2003 NAR paper (Figure 2, linked below). I re-emphasized this distinction in our 2010 GpU dinucleotide platform paper (in particular, see supplementary Figure S2). Unfortunately, this subtle (but crucial, in my opinion) point has never been taken seriously (or at all) by the RNA community, even with 3DNA’s wide adoption. However, as people know 3DNA deeper/better and take RNA base-pair classification more rigorously, I have no doubt that the simplicity of this explicit distinction and the resultant full quantification of each and every possible base pair using standard geometric parameters will gradually be appreciated.

As of 3DNA v2.1, the output of the χ torsion angle is also associated with its classification in anti / syn conformation, among other new features (see for example the output for 6tna).



Sugar pucker correlates with phosphorus-base distance

The sugar puckers in DNA/RNA structures are predominately in either C3′-endo (A-DNA or RNA) or C2′-endo (B-DNA; see Figure below, left), corresponding to the A- or B-form conformation in a duplex. In these two sugar conformations, the distance between neighboring phosphorus (P) atoms and the orientation of P relative to the sugar/bases are also dramatically different (figure below, right).


Recently, I carefully re-read some articles on RNA backbone conformation by Richardson et al., including:

I became intrigued by one of their observations: i.e., the correlation between the sugar pucker and a simple distance parameter:

C3′-endo and C2′-endo sugar puckers are highly correlated to the perpendicular distance between the C1′–N1/9 glycosidic bond vector and the following phosphate: > 2.9 Å for C3′-endo and < 2.9 Å for C2′-endo. (p.16 from the MolProbity paper).

Out of curiosity and for a better understanding of this correlation, I played around with some sample cases both visually and numerically. Overall, this involves a simple geometric calculation, i.e., the shortest distance from a point to a line in three-dimensional space. Given below is the Octave/Matlab script for calculating the distances for G175 and U176 of PDB entry 1jj2 (the large ribosomal subunit of Haloarcula marismortui):

function d = get_p3_nc_dist(P3, C1, N)
    C1_N = N - C1;               # vector from C1′ to N
    nv_C1_N = C1_N / norm(C1_N); # normalized vector
    C1_P3 = P3 - C1;             # vector from C1′ to P3
    proj = dot(C1_P3, nv_C1_N);
    d = norm(C1_P3 - proj * nv_C1_N);

## G175
P3 = [70.104 112.366  44.586];
C1 = [73.017 109.666  45.304];
N9 = [74.445 109.380  45.288];
d1 = get_p3_nc_dist(P3, C1, N9)  # 2.2 Å -- C2′-endo

## U176
P3 = [66.871 116.402  46.804];
C1 = [68.213 112.454  49.279];
N1 = [69.678 112.480  49.438];
d2 = get_p3_nc_dist(P3, C1, N1)  # 4.6 Å -- C3′-endo

The GpU dinucleotide used in the above example forms a platform (see figure below), where the sugar of G175 adopts a C2′-endo conformation, and that of U176 C3′-endo. Indeed, the distance for G175 is 2.2 Å (< 2.9 Å); whilst the value for U176 is 4.6 Å (> 2.9 Å).

Note that the Richardson et al. articles focus on the RNA backbone, without paying attention to the base (pair) geometry. The 3DNA Zp parameter, which is the mean z-coordinate of the two P atoms in the mean reference frame of a dinucleotide step (see figure below), has been readily adapted to single-stranded RNA structures. For example, the vertical distances of the 3′ P atoms to the G175 and U176 base planes are 1.9 Å and 4.4 Å, respectively. Since base planes and the P atoms are the two most accurately located entities in a given nucleic acid structure, the nucleotide-based Zp variant is presumably more robust and discriminative than the distance from P to the glycosidic bond.

definition of the Zp parameter

This new single-stranded based “Zp” parameter is available as of 3DNA v2.1.



GpU dinucleotide platform, the smallest unit with key RNA structural features

RNA has three salient structural features (compared to DNA): it contains the ribose (not deoxyribose) sugar, it has the uracil (not thymine) base, and it is normally single (not double)-stranded. The O2′(G)…O2P(U) H-bond stabilized GpU dinucleotide platform may turn out to be the smallest unit with all those RNA hallmarks.

First, it must have the guanosine ribose to have the 2′-hydroxyl group form the O2′(G)…O2P(U) H-bond.

Second, the methyl group in position 5 of thymine would cause steric clash with guanosine, thus disrupting the N2(G)…O4(U) base-base H-bond to form the GpU dinucleotide platform.

Third, a dinucleotide, by definition, is single-standed. The two H-bonds, plus the covalent linkage, makes the GpU platform extremely rigid (see Figure 1 of our 2010 NAR paper).

Moreover, the GpU platform is directional: swapping the two bases while keeping the sugar-phosphate backbone fixed does not allow for a base-base H-bond, thus no UpG dinucleotide platform.

It worth noting that state-of-the-art quantum chemistry calculations have verified the importance of the O2′(G)…O2P(U) H-bond in stabilizing the GpU dinucleotide platform.



Least-squares fitting procedures with illustrated examples

The least-squares (LS) fitting procedures presented below make use of well known mathematics. Indeed, the methods are so well known and widely used that it is somewhat difficult to locate the original references. In our previous effort to resolve the discrepancies among nucleic acid conformational analysis programs, we came across a variety of LS fitting procedures. Here we provide a detailed description, with step-by-step examples, of our implementation in 3DNA of two LS fitting algorithms based on a covariance matrix and its eigen-system. This post is the revised version of a note first made available in the “Technical Details” section of earlier 3DNA websites.

LS fitting between standard and experimental bases

Three analysis schemes — CompDNA, Curves/Curves+, and RNA — use LS procedures to fit a standard base with an embedded reference frame to an observed base structure. CompDNA and Curves/Curves+ take advantage of the conventional approach of McLachlan [“Least Squares Fitting of Two Structures.” J. Mol. Biol., 128, 74-79 (1979)], while the RNA program implements a closed-form solution of absolute orientation using unit quaternions first introduced by Horn. The two algorithms are mathematically equivalent for the most general cases, since the unit quaternion can be transformed to the rotation matrix given by McLachlan. The Horn method, however, is more straightforward and generally applicable; it can be applied even when one or both of the structures are perfectly planar, whereas the McLachlan approach fails.

Here we use the ideal adenine geometry derived from the high resolution crystal structures of model nucleosides, nucleotides, and bases. The x-, y-, and z-coordinates of the standard base, taken from the NDB, are listed below in the columns labeled sx, sy, and sz, respectively. s_(average) is the geometric center of the base.

              sx      sy      sz   
  1  N9      0.213   0.660   1.287 
  2  C4      0.250   2.016   1.509 
  3  N3      0.016   2.995   0.619 
  4  C2      0.142   4.189   1.194 
  5  N1      0.451   4.493   2.459 
  6  C6      0.681   3.485   3.329 
  7  N6      0.990   3.787   4.592 
  8  C5      0.579   2.170   2.844 
  9  N7      0.747   0.934   3.454 
 10  C8      0.520   0.074   2.491 
s_(average): 0.4589  2.4803  2.3778 

We similarly describe the coordinates of one of the adenine bases (the fifth nucleotide in the sequence strand) from the high resolution (1.4 Å) self-complementary d(CGCGAATTCGCG) dodecamer duplex determined by Williams and co-workers (PDB id: 355d). The experimental xyz coordinates are listed below in the columns labeled ex, ey, and ez. The geometric center is e_(average). Note that the atomic serial numbers from the PDB (first column) have been rearranged so that the atoms are in the same order as those of the ideal base listed above.

              ex      ey      ez  
 91  N9     16.461  17.015  14.676 
100  C4     15.775  18.188  14.459
 99  N3     14.489  18.449  14.756
 98  C2     14.171  19.699  14.406
 97  N1     14.933  20.644  13.839
 95  C6     16.223  20.352  13.555
 96  N6     16.984  21.297  12.994
 94  C5     16.683  19.056  13.875
 93  N7     17.918  18.439  13.718
 92  C8     17.734  17.239  14.207
e_(average):16.1371 19.0378 14.0485

We collect the two sets of xyz coordinates in the 10 × 3 matrices S and E corresponding respectively to the standard and experimental bases. We then construct the 3 × 3 covariance matrix C between S and E using the following formula:

        1             1
 C = ------- [S' E - --- S' i i' E]
      n - 1           n
      0.2782    0.2139   -0.1601
     -1.4028    1.9619   -0.2744
      1.0443    0.9712   -0.6610

Here n, the number of atoms in each base, is 10, and i is an n x 1 column vector consisting of only ones. S' and i' are the transpose of matrix S and column vector i, respectively.

From the nine elements of the C matrix, we subsequently generate the 4 × 4 real symmetric matrix M using the expression:

     | c11+c22+c33     c23-c32       c31-c13        c12-c21    | 
 M = |   c23-c32     c11-c22-c33     c12+c21        c31+c13    | 
     |   c31-c13       c12+c21     -c11+c22-c33     c23+c32    | 
     |   c12-c21       c31+c13       c23+c32      -c11-c22+c33 | 
      1.5792   -1.2456    1.2044    1.6167
     -1.2456   -1.0228   -1.1890    0.8842
      1.2044   -1.1890    2.3447    0.6968
      1.6167    0.8842    0.6968   -2.9011

The largest eigenvalue of matrix M is 4.0335, and its corresponding unit eigenvector is:

 [ q0   q1    q2    q3 ] = [ 0.6135   -0.2878    0.7135    0.1780 ]

The rotation matrix R is deduced from the above eigenvector as below:

     | q0q0+q1q1-q2q2-q3q3    2(q1q2-q0q3)        2(q1q3+q0q2)     | 
 R = |    2(q2q1+q0q3)     q0q0-q1q1+q2q2-q3q3    2(q2q3-q0q1)     | 
     |    2(q3q1-q0q2)        2(q3q2+q0q1)     q0q0-q1q1-q2q2+q3q3 | 
     -0.0817   -0.6291    0.7730
     -0.1923    0.7710    0.6072
     -0.9779   -0.0990   -0.1839

Following coordinate transformation with matrix R, the origin of the standard base is found to be displaced from the experimental structure by:

 o = e_(average) - s_(average) R' = [15.8969 15.7701 15.1802]

The least-squares fitted coordinates (F) of the standard base atoms on the experimental structure are then given by:

 F = S R' + i o
     16.4592   17.0194   14.6699
     15.7747   18.1925   14.4586
     14.4899   18.4519   14.7542
     14.1729   19.6974   14.4070
     14.9343   20.6404   13.8420
     16.2222   20.3472   13.5569
     16.9832   21.2875   12.9925
     16.6829   19.0585   13.8760
     17.9183   18.4437   13.7219
     17.7335   17.2396   14.2062

Here S is the (n x 3) matrix of original coordinates of the standard base, and as noted above, i is an n x 1 column vector consisting of only ones.

The difference matrix (D) between F and E, the (n x 3) matrix of original coordinates of the experimental base, and the root-mean-square (RMS) deviation between the two structures are found as:

 D = E - F
      0.0018   -0.0044    0.0061
      0.0003   -0.0045    0.0004
     -0.0009   -0.0029    0.0018
     -0.0019    0.0016   -0.0010
     -0.0013    0.0036   -0.0030
      0.0008    0.0048   -0.0019
      0.0008    0.0095    0.0015
      0.0001   -0.0025   -0.0010
     -0.0003   -0.0047   -0.0039
      0.0005   -0.0006    0.0008

 RMS deviation = 0.0054

It should be noted that if the standard base is already defined in terms of its reference frame, as in 3DNA (e.g., $X3DNA/config/Atomic_A.pdb), the vector o and the matrix R represent the best-fitted coordinate frame of the experimental base. Moreover, the three axes of the frame given by R are guaranteed to be orthonormal. If you want to get an insight of the LS fitting algorithm and a better understanding of how 3DNA derives its base reference frame, it’d be a valuable experience to repeat the above procedure with $X3DNA/config/Atomic_A.pdb.

Note: the algorithm does not apply to a molecule vs its inversion (an improper rotation) — thanks to Boris Averkiev for reporting this subtle point (see comments below). One possible remedy is to treat this edge case separately.

Base normal

Rather than fit a standard base to experimental coordinates, the CEHS, FREEHELIX, and NUPARM analyses perform a fitting of a LS plane to a set of atoms in order to define the base and base-pair normals. The covariance matrix based on the n x 3 matrix of experimental Cartesian coordinates E is diagonalized to find the vector normal to the best plane. Specifically, C is obtained using the above formula with S substituted by E. The normal vector then lies along the eigenvector that corresponds to the smallest eigenvalue. Note that the coefficient 1/(n-1) in the formula for calculating C has no effect on the direction of the eigenvectors but scales the magnitudes of the eigenvalues.

Using the above adenine base from the high resolution dodecamer duplex as an example, the covariance matrix C is:

 C =
     1.6680   -0.5015   -0.3253
    -0.5015    2.0670   -0.5840
    -0.3253   -0.5840    0.3061

The smallest eigenvalue of C, 8.26e-5, indicates that the base is almost perfectly planar. The corresponding unit eigenvector corresponding to the base normal is:

 Base normal: 0.2737    0.3224    0.9062

Related topics:

Comment [22]


Seeing is understanding as well as believing

As the old saying goes, a picture is worth a thousand words. To help you have a better idea of what 3DNA/DSSR is about, we’ve collected the following pictures; they serve to demonstrate selected features from 3DNA/DSSR’s versatile functionality.

Cartoon-block schematic representations generated with DSSR and PyMOL

yeast phenylalanine tRNA (1ehz) with base blocks yeast phenylalanine tRNA (1ehz) with WC base-pair blocks
1msy: with the minor groove edge (black) of the C-G pair that closes the GUAA tetraloop facing the viewer 27-nt rRNA fragment with GUAA tetraloop (1msy) -- base blocks in outline

Schematic diagram of base-pair parameters

Schematic diagram of rigid body parameters

Influence of Slide and Roll on DNA helical conformation

Influence of Slide and Roll on DNA helical conformation

Roll-introduced DNA bending

Roll-introduced DNA bending

Global bending of DNA associated with selective B → A conformational transformation

Global bending of DNA associated with selective B → A conformational transformation

Canonical fiber models of A-, B-, C- and Z-DNA

Canonical fiber models of A-, B-, C- and Z-DNA

3DNA-generated view of a four-way DNA–RNA junction (1egk)

four-way DNA–RNA junction (1egk)

3DNA-detected pentaplets in the large ribosomal subunit (1jj2)

pentaplets in the large ribosomal subunit (1jj2)

3DNA enabled the discovery of the O2′(G)−O2P(U) H-bond which stabilizes the GpU dinucleotide platform

GpU dinucleotide platform stabilized by the O2′(G)−O2P(U) H-bond

Nucleic-acid-containing structures generated with w3DNA

Nucleic-acid-containing structures generated with w3DNA

Analysis of DNA with a B-Z junction (2acj, left) and detection of hydration patterns (right)

B/Z junction and hydration patterns

Schematics images auto-generated via blocview

2f4u 408d 9ant
complex of the bacterial ribosomal aminoacyl-tRNA site (A- site) with a designer antibiotic (2f4u) drug recognition of A-T and T-A base pairs in the B-DNA minor groove (408d) complex of DNA with the Antennapedia homeodomain (9ant)

Comment [1]


Generating idealized A-form RNA structures of generic sequence

Over the years, the fiber utility program has become a handy way to generate standard B-DNA and A-DNA structures, as evident from citations to 3DNA. Nevertheless, the currently collected 55 experimental fiber models, comprehensive as they are, do not include one for canonical double-stranded (ds) RNA or single-stranded (ss) RNA structures of generic A/C/G/U sequence.

This situation is best illustrated by a recent article by Charles Brooks and Hashim Al-Hashimi and their co-workers, titled Unraveling the structural complexity in a single-stranded RNA tail: implications for efficient ligand binding in the prequeuosine riboswitch [Nucleic Acids Research, 40(3) 1345–1355 (2012)] , where they wrote:

Idealized A-form structures were constructed using Insight II (Molecular Simulations, Inc.) correcting the propeller twist angles from +15° to –15° using an in-house program, as previously described (47). The complementary strand was removed and the resulting ssRNA used in NMR data analysis. B-form helices were constructed using W3DNA (48).

As of 3DNA v2.1, however, that’s no longer the case: now the fiber utility provides direct support for generating idealized dsRNA or ssRNA structures of arbitrary A/C/G/U sequence. As always, the new functionality can be best illustrated with examples. Let’s build ssRNAs of the wild-type (5’-AUAAAAAACUAA-3’) and A29C mutated form (5’-AUAACAAACUAA-3’) used in the work cited above:

fiber -r -s -seq=AUAAAAAACUAA wt-12nt.pdb
fiber -r -s -seq=AUAACAAACUAA mt-12nt.pdb

Here the -r option is for RNA, -s for a ss structure, and -seq for the specific base sequence. The generated ssRNA structure for the wild-type sequence is named wt-12nt.pdb, and that for the mutated sequence named mt-12nt.pdb.

Note that the new RNA model is based on Struther Arnott’s work of fiber A-DNA from calf thymus (#1 in the list). The dsRNA, as its dsDNA counterpart, has a helical twist of 32.7° and a helical rise of 2.548 Å. Relevant to the above citation, here the propeller twist angle of each base pair is –10.5°, a negative value similar to that observed in high-resolution x-ray crystal structures. Furthermore, you can easily verify the three numbers with the following commands:

fiber -r -seq=AUAAAAAACUAA wt-12nt.pdb
find_pair wt-12nt.pdb stdout | analyze stdin

In summary, it is very easy to generate canonical RNA structures with the revised fiber command. Through its integrated analysis routine, 3DNA can also be used to check structural features of the resultant RNA models. Moreover, as mentioned in the opening post What can 3DNA do for RNA structures? on the forum, 3DNA has much to offer in the filed of RNA structural bioinformatics.



Does 3DNA work for RNA?

At the C2B2 party this afternoon, I was asked the question: “Does 3DNA work for RNA?” Well, a good question, indeed. The short answer is definitely, YES. However, a detailed explanation is needed to address the underlying intuitive assumption: 3DNA is only for DNA.

  • The name 3DNA was due to Dr. Olson, after we struggled quite a while. Initially, we played with NuStar (which was actually cited once by Richard Dickerson et al.), and Carnival etc. I still remember the day when Dr. Olson asked me “How about 3DNA?” We immediately reached an agreement: that’s it — what a cute name! Another advantage (as it becomes clear later): since 3DNA starts with ‘3’, it (mostly) shows up right at the top of many on-line lists of bioinformatics tools.
  • Interpreted literally, 3DNA could mean 3-DNA, i.e., the three most common types of DNA: A-, B- and Z-form. That may be one of the reasons where the misconception that 3DNA is only for 3DNA comes from. Another reason could be that structural work on DNA is what the Olson lab best known for.
  • The number ‘3’ in 3DNA should also be associated with its three key components: analysis, rebuilding and visualization. In a sense, this is my favorite.
  • Of course, 3DNA stands for 3D-NA, 3-Dimensional Nucleic Acids, as expressed explicitly in the titles of our two 3DNA papers (2003 NAR and 2008 NP).

The applications of 3DNA to RNA structures can be broadly categorized as follows:

  • Automatically detect all existing base-pairs, Watson-Crick (A-U, G-C, wobble G-U) or non-canonical, using a set of simple geometric criteria. Furthermore, it has a unique base-pair classification system based on the six numerical structural parameters, suitable for database storage and search.
  • Automatically detect all triplets or higher-order base-associations.
  • Automatically detect double helical regions, regardless of backbone connection, thus ideal for finding pseudo-continuous coaxial stacking.
  • The above three features are seamlessly integrated with the visualization component to allow for easy generation of publication quality images. See the 3DNA 2008 NP paper for detailed examples.

As further examples, the following two RNA publications take advantage of find_pair from 3DNA:

It is well worth noting that the base-pair detecting algorithm in RNAView is based on an earlier version of find_pair, a basic fact ignored in the RNAView publication.

In summary, 3DNA works for RNA as well as for DNA, and more.

Comment [2]


About 3DNA

3DNA is a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid-containing structures, including their complexes with proteins and other ligands. At its core, the software uses a simple matrix-based scheme for calculating a complete set of rigid-body parameters that characterize the spatial relationship of the base pairs in DNA and RNA structures. In 3DNA, structural analysis and model rebuilding are the two sides of the same coin: the description of structure is geometrically straightforward and the computation of parameters is mathematically rigorous, allowing for exact rebuilding of a molecular structure based on the derived parameters. Other unique features of 3DNA include: (1) the automatic detection of all possible base pairs, higher-order base associations and helical fragments, which makes analyzing nucleic-acid structures straightforward; (2) a comprehensive collection of 55 fiber models of regular DNA and RNA helices in various polymorphic forms and stoichiometries; (3) generation of base rectangular block schematic presentations highly effective in revealing the key features of small- to medium-sized nucleic-acid structures; (4) in silico base mutation with the sugar-phosphate backbone untouched and the original base reference frame conserved, handy for theoretical studies of DNA-protein interactions; and (5) DSSR for a comprehensive characterization of RNA (secondary) structures from three-dimensional coordinates.

3DNA is written in strict ANSI C computer language, with connecting Ruby scripts. The software package consists of over two-dozen executable programs that can be run directly from the command line in a Unix/Linux-like environment (including Mac OS X, and MinGW/MSYS and Cygwin on Windows). The software has been designed following the Unix philosophy: “Write programs that do one thing and do it well. Write programs to work together.” Each 3DNA component program aims to solve a specific set of closely related problems practically: robust and efficient, getting its job done and then out of the way. Once the user is familiar with the package, it is straightforward to combine the various pieces and other command-line-driven (or capable) tools in a script to automate commonly repeated tasks. Moreover, w3DNA provides a convenient web-interface to commonly used functionality of 3DNA, making the tool accessible to bench scientists, novices non-Linux/Unix users and for educational purposes.

3DNA originated in the SCHNAaP and SCHNArP complementary pair of programs to compute DNA rigid-body parameters and to reconstruct structures from those parameters. Our efforts to resolve the discrepancies among nucleic-acid conformational analysis programs and the consequent definition of the standard base reference frame by the structural biology community prompted us to take advantage of various features in the earlier programs and to adopt the standard frame in 3DNA. The 3DNA software was first made available on the Internet in late 1999, v1.5 was released in late 2002, and then v2.0 in 2008 to accompany the Nature Protocols paper. 3DNA is checked against all NDB entries before each major release to ensure that it works in real world applications. Over the years, we have taken each and every 3DNA-related question from a wide spectrum of user community as an opportunity to improve the functionality of the software. We strive to respond to users as quickly and concretely as possible, often with a step-by-step recipe, until an issue is resolved. This process has helped us to refine and improve the software and has prompted us to add new functionality.

Today, 3DNA has become a prominent structural bioinformatics tool, and is widely used in the scientific community. Overall, 3DNA has received near 900 citations in articles published in ~100 peer-reviewed biology, chemistry and material sciences journals. Broadly speaking, 3DNA’s applications fall into the following three categories: (1) as a standard method for nucleic acid structural analysis; (2) integrated into various structural bioinformatics resources, including the NDB and PDB; (3) enabling new scientific discoveries — 3DNA’s rigorous and reversible engine for the analysis/rebuilding of nucleic acid structures and its efficient and robust implementation in a single software package allow for new hypotheses to be formulated and validated against a large number of structures.

3DNA was initially created by Dr. Xiang-Jun Lu during 1999 to 2002 while working as a post-doctoral research associate in Dr. Wilma Olson’s laboratory at Rutgers — the State University of New Jersey. Thereafter, Dr. Lu — in collaboration with Dr. Olson — had continued to maintain and support 3DNA in his spare time as a sideline project. As of September 2011, 3DNA has been supported by an NIH R01 grant (GM096889), titled “Continued Development and Maintenance of the 3DNA Suite of Programs.” As PI of the funded project, Dr. Lu is now dedicated to pushing 3DNA to the next level to better serve the ever-changing needs of an ever-growing user community. We are working on a new release (currently in beta test) of 3DNA v2.1, which contains refinements/bug-fixes of v2.0, and significant new features.

Comment [2]


Outside links

The following links point to tools that are relevant to 3DNA.

  • Curves+ — an updated version of the well-known Curves program, and it conforms to the standard base reference frame.
  • 3D-DART — 3DNA-Driven DNA Analysis and Rebuilding Tool. Another web-interface to commonly used 3DNA functionality.
  • do_x3dna — “do_x3dna has been developed for analysis of the DNA/RNA dynamics during the molecular dynamics simulations. It uses the 3DNA package to calculate several structural descriptors of DNA/RNA from the GROMACS MD trajectory. It executes 3DNA tools to calculate these descriptors and subsequently, extracts these output and saves into external output files as a function of time.”
  • SwS — a Solvation web Service for Nucleic Acids where 3DNA plays a role.
  • Raster3D — a set of tools for generating high-quality raster images of proteins or other molecules.
  • MolScript — a program for displaying molecular 3D structures, such as proteins, in both schematic and detailed representations.
  • Jmol — an open-source Java viewer for chemical structures in 3D with features for chemicals, crystals, materials, and biomolecules.
  • PyMOL — a user-sponsored molecular visualization system on an open-source foundation.
  • ImageMagick — a software suite to create, edit, compose, or convert bitmap images.
  • NDB — Nucleic acids database.
  • SBGrid — Excellent services for structural biology laboratories as well software developers.




Over the years, 3DNA had benefited greatly from interactions with the NDB project led by Professor Helen Berman. Zukang Feng at the PDB/NDB, A. R. Srinivarsan, Andrew Colasanti, Guofei Zheng, Mauricio Esguerra and other members of the Olson laboratory, Surjit Dixit (Wesleyan University), Pascal Auffinger (IBMC/CNRS, France), Ben Eisenbraun (SBGrid Consortium, Harvard Medical School), and numerous users have helped in making 3DNA a better tool to serve the scientific community.

The 3DNA software, its homepage and forum have been created and maintained by Xiang-Jun Lu. Unless specified explicitly otherwise, all posts at are written by Xiang-Jun Lu. The favicon and logo of the homepage and the forum were designed by Jessalyn Lu. The three 3DNA-related services — w3DNA, 3DNALandscapes and BPS — are supported by the Olson laboratory at Rutgers University.

The 3DNA project is supported by the NIH grant R01GM096889.



New features in 3DNA v2.1

The v2.1 release of 3DNA, currently in beta, contains many refinements of existing C programs, a complete migration from Perl scripts to Ruby, and additions of several significant new programs. All know bugs in v2.0 have been fixed. Highlights include:

  • Added mutate_bases to perform in silico base mutations in nucleic-acid-containing structures (DNA, RNA, and their complexes with ligands and proteins). The program has two key and unique features: (1) the sugar-phosphate backbone conformation is untouched; (2) the base reference frame (position and orientation) is reserved, i.e., the mutated structure shares the same base-pair/step parameters as those of the native structure.
  • Added x3dna_ensemble, a Ruby script to automate the processing of an NMR structure ensemble or MD trajectories in MODEL/ENDMDL delineated PDB format. It has sub-commands analyze, extract, reorient, and block_iamge. To add: convert to transform Amber, Gromacs or CHARMM trajectories.
  • Enhanced find_pair with -c+ option for generating input to Curves+.
  • Expanded fiber with the -s option for generating single stranded structures; the -seq option for specifying base sequence directly on the command line; and the -r option for generating RNA structures (single or double stranded) of arbitrary ACGU sequences.
  • Updated the ‘baselist.dat’ file to incorporate all types of NDB/PDB nucleotides as of February 15, 2015; refined find_pair/analyze/mutate_bases etc to automatically detect and assign of modified bases.
  • Renamed Atomic_a.pdb and Atomic.a.pdb etc for modified bases to account for Mac OS X filesystem case sensitivity issue; Copied all Perl scripts to a new directory perl_scripts/.
  • 3DNA now generates PDB files that are compliant with PDB format v3.x, and also has option to allow for three-letter nucleotide names, thus directly compatible with PdbViewer and HADDock. An option is provided to convert 3DNA-generated base rectangular blocks in Alchemy to the more widely accepted MDL molfile format (e.g. by PyMOL).



What's special about the GpU dinucleotide platform?

Recently, I (together with Drs. Wilma Olson and Harmen Bussemaker – a team with a unique combination of complementary expertise) published a new article in Nucleic Acids Research (NAR): The RNA backbone plays a crucial role in mediating the intrinsic stability of the GpU dinucleotide platform and the GpUpA/GpA mini duplex. The key findings of this work are summarized in the abstract:

The side-by-side interactions of nucleobases contribute to the organization of RNA, forming the planar building blocks of helices and mediating chain folding. Dinucleotide platforms, formed by side-by-side pairing of adjacent bases, frequently anchor helices against loops. Surprisingly, GpU steps account for over half of the dinucleotide platforms observed in RNA-containing structures. Why GpU should stand out from other dinucleotides in this respect is not clear from the single well-characterized H-bond found between the guanine N2 and the uracil O4 groups. Here, we describe how an RNA-specific H-bond between O2’(G) and O2P(U) adds to the stability of the GpU platform. Moreover, we show how this pair of oxygen atoms forms an out-of-plane backbone ‘edge’ that is specifically recognized by a non-adjacent guanine in over 90% of the cases, leading to the formation of an asymmetric miniduplex consisting of ‘complementary’ GpUpA and GpA subunits. Together, these five nucleotides constitute the conserved core of the well-known loop-E motif. The backbone-mediated intrinsic stabilities of the GpU dinucleotide platform and the GpUpA/GpA miniduplex plausibly underlie observed evolutionary constraints on base identity. We propose that they may also provide a reason for the extreme conservation of GpU observed at most 5’-splice sites.

As a nice surprise, this publication was selected by NAR as a featured article! According to the NAR website:

Featured Articles highlight the best papers published in NAR. These articles are chosen by the Executive Editors on the recommendation of Editorial Board Members and Referees. They represent the top 5% of papers in terms of originality, significance and scientific excellence.

I feel very gratified with the “extra” recognition. From my own perspective, I can easily rank this paper as the top one in my publication list: from the very beginning, I has been struck by the simplicity and elegance of the GpU story. Hopefully, time will verify the validity of this scientific contribution.

Behind the hood, though, there is a long, complex (sometimes perplexing), yet interesting story associated with this work. Here is how it got started. While writing the 3DNA 2008 Nature Protocols (NP) paper, I selected the (previously undocumented) ‘-p’ option of find_pair to showcase its capability to identify higher-order base associations, using the large ribosomal subunit (1jj2) as an example. I noticed the unexpected O2’(G)⋅⋅⋅O2P H-bond within the GpU dinucleotide platform in a pentaplet (Figure A below). I was/am well aware of Leontis-Westholf’s pioneering work on Geometric nomenclature and classification of RNA base pairs which involves three distinct edges – the Watson-Crick edge, the Hoogsteen edge, and the Sugar edge, yet without taking into consideration of possible sugar-phosphate backbone interactions (Figure B below). So I decided to double-check, just to be sure that the H-bond was not spurious due to defects in the H-bond detecting scheme of find_pair, and the finding was very surprising.

Backbone-edge in GpU platform

The following section was re-added into the 3DNA NP paper in the very last revision:

It is also worth noting that the G1971–U1972 platform is stabilized not only by the well-characterized G(N2)⋅⋅⋅U(O4) H-bond interaction, but also by a little-noticed G(O2’)⋅⋅⋅U(O2P) sugar-phosphate backbone interaction (Fig. 6a). Examination of the 50S large ribosomal unit (1JJ2) alone reveals ten such double H-bonded G–U platforms, far more occurrences than those registered by any other dinucleotide platform (including A–A) in this structure. Apparently, the G–U platform is more stable than other platforms with only a single base–base H-bond interaction. We are currently investigating this overrepresented G–U dinucleotide platform in other RNA structures. (p.1226)

See also Is the O2’(G)…O2P H-bond in GpU platforms real?



What find_pair in 3DNA can do

Structural analysis of nucleic acids used to be a rather tedious process, especially for irregular, complicated RNA structures and nucleic-acid/protein complexes [e.g., the large ribosomal subunit of H. marismortui (1jj2)]. Without valid base-pairing information arranged properly in a duplex fragment as input, analysis programs such as Curves+ and analyze/cehs in 3DNA would produce meaningless results. The program find_pair in 3DNA was originally created to solve this specific problem, i.e., to generate an input file to 3DNA analysis routines directly from a nucleic-acid containing structure in PDB format. It is what makes nucleic acids structural analysis a routine process — running through thousands of structures from NDB/PDB can be fully automated.

Overall, find_pair has more than fulfilled the goal of its initial design (as stated above). Over the past few years, its functionality has been expanded and continuously refined (kaizen 改善), making find_pair itself a full-featured application. Now, it is efficient, robust, and its simple command line interface allows for easy integration with other bioinformatics tools. Properly acknowledged or otherwise, find_pair has served (at least) as one of the key components in many other applications (RNAView, BPS, SwS, ARTS, to name just a few). Indeed, find_pair is by far the single program in 3DNA that has received the most questions (as evident from the 3DNA forum).

While I still have to write a method paper to describe the underlying algorithms of find_pair in detail — i.e., for identifying nucleotides, H-bonds, base pairs, high-order base associations, and double helical regions — the basic idea is intuitive and very easy to understand: as summarized in our recent GpU paper”, find_pair is purely geometric based (with user adjustable parameters) and allows for the identification of canonical Watson–Crick as well as non-canonical base pairs, made up of normal or modified bases, regardless of tautomeric or protonation state. For example, in the GpU paper”, we chose the following set of stringent parameters to ensure that the geometry of each identified base pair is nearly planar and supports at least one inter-base H-bond: (i) a vertical distance (stagger) between base planes ≤ 1.5 Å; (ii) an angle between base normal vectors ≤ 30°; and (iii) a pair of nitrogen and/or oxygen base atoms at a distance ≤ 3.3 Å. Other criteria (documented or otherwise), such as the distance between the origins of the two standard base reference frames, are just filters to speed up the calculations.

In a nutshell, find_pair has the following two core functionalities:

  • The default is to generate input to the analysis routines in 3DNA (analyze/cehs) for double helices. However, there are many more job to perform under the hood than just identifying base pairs: the base pairs must be in proper sequential order, and each strand must be in 5’ to 3’ direction, for the calculated step parameters (twist, roll etc) to make sense. Moreover, with the “-c” option, one gets an input file to Curves (but not Curves+, yet); with the “-s” or “-1” option, find_pair treats the whole structure as one single strand, and is useful for getting all backbone torsion angles.
  • Detect all base pairs (regardless of double helical regions) and higher-oder (3+) base associations with the “-p” option. This feature (in its preliminary form) was there starting from at least v1.5, which was released at the end of 2002 (just before I left Rutgers), but it was intentionally undocumented. The source code of find_pair (as part of 3DNA) was tested and shared within Rutgers (NDB and Dr. Olson’s laboratory) before any 3DNA paper was published, and served as the basis for several other projects. We also offered 3DNA (with source code) to a few RNA experts for comments; but we received either no responses or politely-worded negative ones. Things did not work out as (what I thought) they should have been, but that’s life and I have learned my lessons. The “-p” option was first explicitly mentioned in the 3DNA 2008 Nature Protocols paper, to illustrate how to identify the two pentaplets in the large ribosomal subunit of H. marismortui (1jj2).

It is interesting to mention the two papers I’ve recently come across: the first is on DNA-protein interactions and the second on RNA base-pairing, where new algorithms were developed to detect base pairs and their performances were compared with find_pair. In each of the two cases, it was claimed that find_pair missed certain pairs where the new methods succeeded. As it turned out, however, in the first case, simply relaxing find_pair’s default H-bond distance cut-off 4.0 Å to 4.5 Å, as used by the authors, virtually all the missing pairs were recovered. In the second case, the “-p” option, which should have been, was simply not specified.

After nearly a decade of extensive real-world applications and refinements, it is safe to say that find_pair is now a versatile and practical tool for nucleic acids structure analysis. Of course, I will continue to support and further refine find_pair as I see fit. Once in a while, I just cannot stop but to think that find_pair is to nucleic acids what DSSP is to proteins: simple and elegant. As more people become aware of its existence, I would expect find_pair to gain even more widespread usage, especially in RNA-structure related research areas.



Curves+ vs 3DNA

While browsing Nucleic Acids Research recently, I noticed the paper titled Conformational analysis of nucleic acids revisited: Curves+ by Dr. Lavery et al. I read it through carefully during the weekend and played around with the software. Overall, I was fairly impressed, and also happy to see that “It [Curves+] adopts the generally accepted reference frame for nucleic acid bases and no longer shows any significant difference with analysis programs such as 3DNA for intra- or inter-base pair parameters.”

Anyone who has ever worked on nucleic acid structures (especially DNA) should be familiar with Curves, an analysis program that has been widely used over the past twenty years. Only in recent years has 3DNA become popular. By and large, though, it is my opinion that 3DNA and Curves are constructive competitors in nucleic acid structure analysis with complementary functionality. As I put it six years ago, before the 13th Conversation at Albany: “Curves has special features that 3DNA does not want to repeat/compete (e.g. global parameters, groove dimension parameters). Nevertheless, we provide an option in a 3DNA utility program (find_pair) to generate input to Curves directly from a PDB data file” on June 6, 2003, and emphasized again on June 09, 2003: “We also see Curves unique in defining global parameters, bending analysis and groove dimensions.” 3DNA’s real strength, as demonstrated in our 2008 Nature Protocols paper, lies in its integrated approach that combines nucleic acid structure analysis, rebuilding, and visualization into a single software package (see image below).

3DNA v2 composite image

Now the nucleic acid structure community is blessed with the new Curves+, which “is algorithmically simpler and computationally much faster than the earlier Curves approach”, yet still provides its ‘hallmark’ curvilinear axis and “a full analysis of groove widths and depths”. When I read the text, I especially liked the INTRODUCTION section, which provides a nice summary of relevant background information on nucleic acid conformational analysis. An important feature of Curves+ is its integration of the analysis of molecular dynamics trajectories. In contrast, 3DNA lacks direct support in this area (even though I know of such applications from questions posted on the 3DNA forum), mostly due to the fact that I am not an ‘energetic’ person. Of special note is a policy-related advantage Curves+ has over 3DNA: Curves+ is distributed freely, and with source code available. On the other hand, due to Rutgers’ license constraints and various other (undocumented) reasons, 3DNA users are still having difficulty in accessing 3DNA v2.0 I compiled several months ago!

It is worth noting that the major differences in slide (+0.47 Å) and x-displacement (+0.77 Å) in Curves+ vs the old Curves (~0.5 Å and ~0.8 Å, respectively) are nearly exactly those uncovered a decade ago in Resolving the discrepancies among nucleic acid conformational analyses [Lu and Olson (1999), J. Mol. Biol., 285(4), 1563-75]:

Except for Curves, which defines the local frame in terms of the canonical B-DNA fiber structure (Leslie et al., 1980), the base origins are roughly coincident in the different schemes, but are significantly displaced (~0.8 Å along the positive x-axis) from the Curves reference. As illustrated below, this offset gives rise to systematic discrepancies of ~0.5 Å in slide and ~0.8 Å in global x-displacement in Curves compared with other programs, and also contributes to differences in rise at kinked steps. (p. 1566)

Please note that Curves+ has introduced new name list variables — most notably, lib= — and other subtle format changes, thus rendering the find_pair generated input files (with option ‘-c’) no longer valid. However, it would be easy to manually edit the input file to make it work for Curves+, since the most significant part — i.e., specifying paired nucleotides — does not change. Given time and upon user request, however, I would consider to write a new script to automate the process.

Overall, it is to the user community’s advantage to have both 3DNA and Curves+ or a choice between the two programs, and I am more than willing to build a bridge between them to make users’ lives easier.

Comment [2]


Thank you for printing this article from Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu