Systematic Pre-calculated Protein Structure Alignments

 
Documentation
 
Overview
 

Introduction

Proteins can have various degrees of similarity. If two proteins show high similarity in their amino acid sequence, it is generally assumed that they are closely evolutionary related. With increasing evolutionary distance the degree of similarity usually drops. Even if the sequence similarity is low, proteins can still show similar function and have an overall similar 3D structure. The detection of such remote similarities is important in order to infer functional and evolutionary relationships between protein families and is a core technique used in structural bioinformatics. The goal is to establish regions of structural similarity between two or more molecules.

While protein sequence comparisons can be computed quickly, the calculation of protein structure alignments   is much more time consuming. The RCSB PDB offers tools, that allow users to quickly identify protein sequence neighbors and run pairwise protein structure comparisons. To help identify more distant 3D relationships, a pre-calculated set of 3D protein structure alignments is available through the 3D similarity tab.

 
Screenshot of a protein structure alignment.

Screenshot of a pairwise protein structure alignment. It has been calculated using the jCE algorithm, available through the Protein Comparison Tool at the RCSB web site.

Show Example (4HHB.A vs 4HHB.B jCE)

top
 

Domain-split Representatives

 

Sequence representatives

Representative protein domains are being used since calculation of a real all vs. all comparison would require too much CPU time. The procedure to come up with the domain-split representative is an extension of our protein-chain sequence clustering approach. In order to remove redundancy, we start with a 40% sequence identity clustering procedure. All sequences in a cluster are sorted and are being represented by the protein chain on rank #1. This is usually the chain with the highest resolution and has been determined by X-ray Crystallography.

Multiple domains

In case the representative chain consists out of multiple domains, each of those domains are included in database searches. If available, the domain assignment as provided by SCOP 1.75 is used. Otherwise algorithmic domain assignments are computed, using the ProteinDomainParser software.

Example: Try the Cyclodextrin glycosyl transferase 3BMV

Chains that are grouped together in a cluster of chains with 40% sequence similarity and then ranked, are being represented by the protein chain on rank #1. This is usually the chain with the highest resolution and has been determined by X-ray Crystallography. If a PDB chain is accessed, that is not the representative, the results for the representative chain are loaded automatically.

At the present systematic comparisons contain about 1 billion pairwise alignments. These bulk of these have been calculated on the Open Science Grid. A technical report describing the details of how this calculations were run is available from renci.org  . At the present weekly updates for new structures are calculated using RCSB servers.



top
 

Results

 

The 3D similarity tab shows the results of the systematic comparisons of the representative domains. The results can be sorted and filtered based on various scores.

Show Example (link to 4HHB - 3D similarity tab)

The screenshot above shows the summary results that are available for a database search. In order to obtain a detailed view of the results, click on the PDB ID in a row. Each column in this table can be sorted. The results can be filtered based on various criteria.

Meaning of the column labels:

The table is sorting is by P-value by default. Clicking on the column header will change the sort order. Select the Filter Results lense img icon to apply other filtering criteria.


 

Interactive Structure Alignment Display

 

The pairwise view of a structure alignment can be used to investigate protein sequence and structure relationships between the sequence-representation of the alignment and the 3D display in Jmol. Regions can be selected in the bottom sequence display to see where they are in the 3D Jmol display.



top
 

Source Code download

 

The source code of the RCSB Protein Comparison Tool is available for download and local installation from http://source.rcsb.org/jfatcatserver/download.jsp. The source code is available as open source under the LGPL license via the BioJava project and hosted on github.

 
top
 
 

Full Download

 

A tab-separated file containing the results for all structural representatives is available for download via FTP. Note: The file has a compressed size of several hundred MB.

ftp://resources.rcsb.org/fatcat_rigid_pdb_all/fatcat_rigid_pdb_all.txt.gz

top
 

XML Download

 

The all vs. all structural similarity results table for a representative chain can be downloaded in XML. For example, this returns Rank, PDB.Chain, Description, P-value, Score, RMSD, Len1, Len2, %Sim1, and %Sim2 for 3BMV.A /pdb/explorer/structCompXMLData.jsp?method=pw_fatcat&chain=d3bmva1&page=1&rows=10&prettyXML. Note: A maximum 2000 rows can be returned through this URL. To fetch all approx. 17,000 results for a chain,you need to slice through the results using the page parameter.



top
 

Algorithms

 

A number of algorithms are provided for structural comparison. The precalculated results are based on FATCAT-rigid. The downloadable Protein Comparison Tool can use CE, CE-CP, FATCAT-rigid, and FATCAT-flexible for structural comparisons, as well as the Smith-Waterman algorithm for sequence alignment. The website offers additional services for pairwise alignments, including TM-align, TopMatch, and Dali through external servers.

FATCAT

The all vs. all comparisons are based on jFATCAT, a Java port of the original FATCAT algorithm   see:

Yuzhen Ye & Adam Godzik (2003)
Flexible structure alignment by chaining aligned fragment pairs allowing twists.  
Bioinformatics vol.19 suppl. 2. ii246-ii255.
[pdf  ]

Two flavors of jFATCAT are available. FATCAT-rigid uses a rigid-body superposition to align the two structures. FATCAT-flexible introduces 'twists' between different parts of the proteins which are superimposed independently. This is ideal for proteins which undergo large conformational shifts, where a global superposition cannot capture the underlying similarity between domains. For instance, the structures of calmodulin with and without calcium bound can be much better aligned with FATCAT-flexible than with one of the rigid alignment algorithms. The downside of this is that it can lead to additional false positives in unrelated structures.

 

Combinatorial Extension (CE)

The RCSB PDB Protein Comparison Tool also provides jCE, a Java port of the CE algorithm. see:

I.N. Shindyalov, P.E. Bourne (1998)
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. 
Protein Eng 11: 739-747
[pdf  ]

CE performs a rigid-body superposition of the proteins, similar to FATCAT-rigid.

The tool also provides CE with Circular Permutations (CE-CP). CE and FATCAT both assume that aligned residues occur in the same order in both proteins (e.g. they are both sequence-order dependent algorithms). In proteins related by a circular permutation, the N-terminal part of one protein is related to the C-terminal part of the other, and vice versa. CE-CP allows circularly permuted proteins to be compared. For more information on circular permutations, see the Wikipedia or Molecule of the Month articles.

Example: Circular permutation of concanavalin A and peanut lectin:

View Example
 
[further reading]
 
 

We also provide the jCE and jFATCAT source code that is used for the RCSB PDB Protein Comparison Tool for local execution free of charge. The source code is available from http://source.rcsb.org



top
 

Expert Mode for RCSB PDB Protein Comparison Tool

 

Want more control in using structure alignment algorithms? Would you like to better understand how the algorithms work by trying different parameter sets? The new jCE/jFATCAT user interface supports manipulation of low-level alignment parameters. This option is for experts that have a basic understanding of the alignment algorithms.

As an example is the maximum gap size parameter G during the extension of Aligned Fragment Pairs of the CE algorithm. The parameter is by default set to 30, a trade-off for performance vs. result accuracy. For the protein pair 1CDG.A and 1TIM.A, the default parameters can't identify the whole TIM barrel that is in common between the two chains. Removing the restriction on the parameter G (by setting it to 0) increases the calculation time, but gives an alignment that is 25 residues longer.

Tip: To change parameters, launch the Align custom files menu and click the Parameters button.



top
 

PDB wide structure alignments

 

The Structure Alignment Tool also provides functionality for PDB-wide structural searches. This systematically compares a query structure against all representative structures in the PDB. Comparisons can be made to either representative chains or domains, as described above.

To do a PDB-wide structure alignment, use the 'Database Search' panel of the Structure Alignment Tool. The selected output directory will be used to store results. These consist of individual alignments in compressed XML format, as well as a tab-delimited file of similarity scores and statistics. The statistics are displayed in an interactive results table, which allows the alignments to be sorted. The 'Align' column allows individual alignments to be visualized with the alignment GUI.



top
 

Reference

 

RCSB PDB Comparison Tool Reference
Andreas Prlic; Spencer Bliven; Peter W. Rose; Wolfgang F. Bluhm; Chris Bizon; Adam Godzik; Philip E. Bourne (2010)
Pre-calculated protein structure alignments at the RCSB PDB website
Bioinformatics 26: 2983-2985
[PubMed][abstract]

 
top