Systematic Pre-calculated Protein Structure Alignments



Proteins can have various degrees of similarity. If two proteins show high similarity in their amino acid sequence, it is generally assumed that they are closely evolutionary related. With increasing evolutionary distance the degree of similarity usually drops. Even if the sequence similarity is low, proteins can still show similar function and have an overall similar 3D structure. The detection of such remote similarities is important in order to infer functional and evolutionary relationships between protein families and is a core technique used in structural bioinformatics. The goal is to establish regions of structural similarity between two or more molecules.

While protein sequence comparisons can be computed quickly, the calculation of protein structure alignments   is much more time consuming. The RCSB PDB offers tools, that allow users to quickly identify protein sequence neighbors and run pairwise protein structure comparisons. To help identify more distant 3D relationships, a pre-calculated set of 3D protein structure alignments is available through the 3D similarity tab.


Screenshot of a pairwise protein structure alignment. It has been calculated using the jCE algorithm, available through the Protein Comparison Tool at the RCSB web site.

Show Example (4HHB.A vs 4HHB.B jCE)


Domain-split Representatives


Sequence representatives

Representative protein domains are being used since calculation of a real all vs. all comparison would require too much CPU time. The procedure to come up with the domain-split representative is an extension of our protein-chain sequence clustering approach. In order to remove redundancy, we start with a 40% sequence identity clustering procedure. All sequences in a cluster are sorted and are being represented by the protein chain on rank #1. This is usually the chain with the highest resolution and has been determined by X-ray Crystallography.

Multiple domains

In case the representative chain consists out of multiple domains, each of those domains are included in database searches. If available, the domain assignment as provided by SCOP 1.75 is used. Otherwise algorithmic domain assignments are computed, using the ProteinDomainParser software.

Example: Try the Cyclodextrin glycosyl transferase 3BMV

Chains that are grouped together in a cluster of chains with 40% sequence similarity and then ranked, are being represented by the protein chain on rank #1. This is usually the chain with the highest resolution and has been determined by X-ray Crystallography. If a PDB chain is accessed, that is not the representative, the results for the representative chain are loaded automatically.

At the present systematic comparisons contain about 1 billion pairwise alignments. These bulk of these have been calculated on the Open Science Grid. A technical report describing the details of how this calculations were run is available from renci.org  . At the present weekly updates for new structures are calculated using RCSB servers.




The 3D similarity tab shows the results of the systematic comparisons of the representative domains. The results can be sorted and filtered based on various scores.

Show Example (link to 4HHB - 3D similarity tab)

The screenshot above shows the summary results that are available for a database search. In order to obtain a detailed view of the results, click on the PDB ID in a row. Each column in this table can be sorted. The results can be filtered based on various criteria.

Meaning of the column labels:
  • Rank: current row position. Changes with different sorting orders and filter rules
  • Domain 2: Domain name of 2nd domain. Can be either a SCOP ID (d<PDB ID><Chain ID><Domain ID>), ProteinDomainParser ID (PDP:<PDB ID><Chain ID><Domain ID>)
  • Title: Protein chain description
  • P-value: P-value of this alignment (FATCAT) (default sorted by this)
  • Score: Raw alignment score (FATCAT)
  • RMSD: RMSD value of the alignment
  • Len1: Domain 1 length
  • Len2: Domain 2 length
  • %ID: % sequence identity in the alignment. A 40% sequence identity filter is applied before the structure alignments are calculated, so most results show low similarity. If the sequences are of vastly different lengths, the clustering procedure will group them in different clusters, even if they share a region of high sequence similarity.
  • %Cov1: The coverage, or %, of aligned residues in chain 1
  • %Cov2: The coverage, or %, of aligned residues in chain 2

The table is sorting is by P-value by default. Clicking on the column header will change the sort order. Select the Filter Results lense img icon to apply other filtering criteria.


Interactive Structure Alignment Display


The pairwise view of a structure alignment can be used to investigate protein sequence and structure relationships between the sequence-representation of the alignment and the 3D display in Jmol. Regions can be selected in the bottom sequence display to see where they are in the 3D Jmol display.


Full Download


A tab-separated file containing the results for all structural representatives is available for download via FTP. Note: The file has a compressed size of several hundred MB.



XML Download


The all vs. all structural similarity results table for a representative chain can be downloaded in XML. For example, this returns Rank, PDB.Chain, Description, P-value, Score, RMSD, Len1, Len2, %Sim1, and %Sim2 for 3BMV.A /pdb/explorer/structCompXMLData.jsp?method=pw_fatcat&chain=d3bmva1&page=1&rows=10&prettyXML. Note: A maximum 2000 rows can be returned through this URL. To fetch all approx. 17,000 results for a chain,you need to slice through the results using the page parameter.




The all vs. all comparisons are based on jFATCAT, a Java port of the original FATCAT algorithm   see:

Yuzhen Ye & Adam Godzik (2003)
Flexible structure alignment by chaining aligned fragment pairs allowing twists.  
Bioinformatics vol.19 suppl. 2. ii246-ii255.
[pdf  ]

The RCSB PDB Protein Comparison Tool also provides jCE, a Java port of the CE algorithm. see:

I.N. Shindyalov, P.E. Bourne (1998)
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. 
Protein Eng 11: 739-747
[pdf  ]

Both the jCE and jFATCAT Java versions of the algorithms have been implemented by Andreas Prlić. The CE variation with support for Circular Permutations is based on this Java version and has been implemented by Spencer Bliven.

jCE and jFATCAT are based on CE, FATCAT  , BioJava  , and Jmol  .

We also provide the jCE and jFATCAT source code that is used for the RCSB PDB Protein Comparison Tool for local execution free of charge. If you are interested in obtaining a copy, please contact us.


Expert Mode for RCSB PDB Protein Comparison Tool


Want more control in using structure alignment algorithms? Would you like to better understand how the algorithms work by trying different parameter sets? The new jCE/jFATCAT user interface supports manipulation of low-level alignment parameters. This option is for experts that have a basic understanding of the alignment algorithms.

As an example is the maximum gap size parameter G during the extension of Aligned Fragment Pairs of the CE algorithm. The parameter is by default set to 30, a trade-off for performance vs. result accuracy. For the protein pair 1CDG.A and 1TIM.A, the default parameters can't identify the whole TIM barrel that is in common between the two chains. Removing the restriction on the parameter G (by setting it to 0) increases the calculation time, but gives an alignment that is 25 residues longer.

Tip: To change parameters, launch the Align custom files menu and click the Parameters button.


Improvements in Release 1005


Structure Alignment Service

In the previous version of the RCSB PDB web site, we required our users to launch a stand alone Java Web Start application in order to calculate custom protein alignments.

In this release we introduce a new Structure Alignment Service. Any alignment of protein chains can now get calculated and displayed as part of the RCSB PDB web site. Results get cached so other users that want to view the same result can quickly access all pre-calculated results.


Support for Circular Permutations

Circular Permutations (CP) in proteins are difficult for alignment algorithms to detect since they depend on the sequence order. These permutations are re-arrangements of the sequence, resulting in different amino acid connectivity, while conserving the overall 3D shape of a protein. The Java version of CE has introduced support for the detection of such permutations. This is achieved by working with an internal representation of the proteins that allows to identify alignments across N and C termini. Since this slows down the calculation about three times, the detection of CP is optional.

Example: Circular permutation of two nucleotide binding proteins:

View Example
[further reading]

Source Code download


The source code of the RCSB Protein Comparison Tool is available for download and local installation from http://source.rcsb.org/jfatcatserver/download.jsp. The source code is available as open source under the LGPL license via the BioJava project and hosted on github.




RCSB PDB Comparison Tool Reference
Andreas Prlic; Spencer Bliven; Peter W. Rose; Wolfgang F. Bluhm; Chris Bizon; Adam Godzik; Philip E. Bourne (2010)
Pre-calculated protein structure alignments at the RCSB PDB website
Bioinformatics 26: 2983-2985