DSSR-derived secondary structure in .ct format

From early on, DSSR-derived nucleic acid secondary structures have been written in the compact dot-bracket notation (.dbn) with pseudo-knot information. To better connect DSSR to the 2D world, I recently looked into the connect (.ct) format, which was first introduced by Zuker’s mfold program. Over time, the .ct format has become one of the most commonly used RNA secondary structure formats, and it is more expressive than the .dbn format (see below).

As of v1.0, for each analyzed structure, DSSR produces two secondary structure files with default names dssr-2ndstrs.dbn and dssr-2ndstrs.ct, in .dbn and .ct formats, respectively. Using the 27-nucleotides (nt) RNA fragment 1msy as an example, the DSSR-derived secondary structure in .dbn and .ct formats are shown below:

1msy [GUAA tetra loop] in 3d and 2d representations

In dot-bracket notation (.dbn) [dssr-2ndstrs.dbn]
------------------------------------------------------
>1msy nts=27 DSSR-derived secondary structure
UGCUCCUAGUACGUAAGGACCGGAGUG
.(((((.....(....)....))))).
------------------------------------------------------

In connect format (.ct) [dssr-2ndstrs.ct]
------------------------------------------------------
   27 DSSR-derived secondary structure in '1msy'
    1 U     0     2     0  2647 # name=A.U2647
    2 G     1     3    26  2648 # name=A.G2648, pairedNt=A.U2672
    3 C     2     4    25  2649 # name=A.C2649, pairedNt=A.G2671
    4 U     3     5    24  2650 # name=A.U2650, pairedNt=A.A2670
    5 C     4     6    23  2651 # name=A.C2651, pairedNt=A.G2669
    6 C     5     7    22  2652 # name=A.C2652, pairedNt=A.G2668
    7 U     6     8     0  2653 # name=A.U2653
    8 A     7     9     0  2654 # name=A.A2654
    9 G     8    10     0  2655 # name=A.G2655
   10 U     9    11     0  2656 # name=A.U2656
   11 A    10    12     0  2657 # name=A.A2657
   12 C    11    13    17  2658 # name=A.C2658, pairedNt=A.G2663
   13 G    12    14     0  2659 # name=A.G2659
   14 U    13    15     0  2660 # name=A.U2660
   15 A    14    16     0  2661 # name=A.A2661
   16 A    15    17     0  2662 # name=A.A2662
   17 G    16    18    12  2663 # name=A.G2663, pairedNt=A.C2658
   18 G    17    19     0  2664 # name=A.G2664
   19 A    18    20     0  2665 # name=A.A2665
   20 C    19    21     0  2666 # name=A.C2666
   21 C    20    22     0  2667 # name=A.C2667
   22 G    21    23     6  2668 # name=A.G2668, pairedNt=A.C2652
   23 G    22    24     5  2669 # name=A.G2669, pairedNt=A.C2651
   24 A    23    25     4  2670 # name=A.A2670, pairedNt=A.U2650
   25 G    24    26     3  2671 # name=A.G2671, pairedNt=A.C2649
   26 U    25    27     2  2672 # name=A.U2672, pairedNt=A.G2648
   27 G    26     0     0  2673 # name=A.G2673
------------------------------------------------------

Presumably, the .ct format is very simple, and examining a sample file as shown above would give one a pretty good sense of what each column is about. While there exist many oversimplified descriptions of the .ct format on the web, the most detailed and accurate explanation is from the mfold manual:

The ``ct’‘ file (connect table) contains the sequence and base pair information, and is meant to be an input file for a structure drawing program. In addition to containing base pair information, it also lists the 5′ and 3′ neighbor of each base, allowing for the representation of circular RNA or multiple molecules. The ct file also lists the historical base numbering in the original sequence, as bases and base pairs are numbered according from 1 to the size of the folded segment. A portion of a ct file is displayed in Figure 12.

Figure 12: The ct file for the second and final folding of S. cerevisiae Phe-tRNA at 37°, with default parameters. The first record displays the fragment size (76), ΔG and sequence name. The ith subsequent record contains, in order, i, ri, the index of the 5′-connecting base, the index of the 3′-connecting base, the index of the paired base and the historical numbering of the ith base in the original sequence. The 5′, 3′ and base pair indices are 0 when there is no connection or base pair.

Specifically, the 3rd, 4th, and 6th columns in the .ct format convey specific information; by design, they are not redundant to information contained in the 1st column. Note that in the above ‘1msy’ example, the 6th column gives the nt sequence numbers (as in the PDB datafile) instead of the serial numbers (as in the 1st column). The DSSR produced .ct files also contain extra information after ‘#’, in the comma separated key=value format.

As an example of the usefulness of the 3rd and 4th columns, have a look of the DSSR-derived .ct file for the Dickerson DNA dodecamer duplex with sequence CGCGAATTCGCG:

   24 DSSR-derived secondary structure in '355d'
    1 C     0     2    24     1 # name=A.DC1, pairedNt=B.DG24
    2 G     1     3    23     2 # name=A.DG2, pairedNt=B.DC23
    3 C     2     4    22     3 # name=A.DC3, pairedNt=B.DG22
    4 G     3     5    21     4 # name=A.DG4, pairedNt=B.DC21
    5 A     4     6    20     5 # name=A.DA5, pairedNt=B.DT20
    6 A     5     7    19     6 # name=A.DA6, pairedNt=B.DT19
    7 T     6     8    18     7 # name=A.DT7, pairedNt=B.DA18
    8 T     7     9    17     8 # name=A.DT8, pairedNt=B.DA17
    9 C     8    10    16     9 # name=A.DC9, pairedNt=B.DG16
   10 G     9    11    15    10 # name=A.DG10, pairedNt=B.DC15
   11 C    10    12    14    11 # name=A.DC11, pairedNt=B.DG14
   12 G    11     0    13    12 # name=A.DG12, pairedNt=B.DC13
   13 C     0    14    12    13 # name=B.DC13, pairedNt=A.DG12
   14 G    13    15    11    14 # name=B.DG14, pairedNt=A.DC11
   15 C    14    16    10    15 # name=B.DC15, pairedNt=A.DG10
   16 G    15    17     9    16 # name=B.DG16, pairedNt=A.DC9
   17 A    16    18     8    17 # name=B.DA17, pairedNt=A.DT8
   18 A    17    19     7    18 # name=B.DA18, pairedNt=A.DT7
   19 T    18    20     6    19 # name=B.DT19, pairedNt=A.DA6
   20 T    19    21     5    20 # name=B.DT20, pairedNt=A.DA5
   21 C    20    22     4    21 # name=B.DC21, pairedNt=A.DG4
   22 G    21    23     3    22 # name=B.DG22, pairedNt=A.DC3
   23 C    22    24     2    23 # name=B.DC23, pairedNt=A.DG2
   24 G    23     0     1    24 # name=B.DG24, pairedNt=A.DC1

Note the 0 at the 4th column for A.DG12 which is at the 3′ end of chain A, and the 0 at 3rd column for B.DC13 which is at the 5′ end of chain B.

---

Comment

 
---

·

Thank you for printing this article from http://home.x3dna.org/. Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu