A super position
A very common task for any protein crystallographer
is the superpositioning of macromolecules which are
more or less identical (NCS-related molecules, identical
structures solved in different labs, in different spacegroups
or with different methods, mutants or complexes of
a certain protein, etc.) or display structural similarities.
If two molecules have ~90 % or more sequence identity,
an explicit superpositioning of the two structures
can be carried out by specifying which atoms are to
be matched in the two molecules, and calculating the
rotation/translation operator which minimises the sum
of the (squares of) the distances between corresponding
atoms. If the sequences are more distantly related,
or even unrelated, the problem becomes less trivial.
In particular, defining the similarity becomes ambiguous,
and many papers have been written about methods to
accomplish this.
We have written a program called LSQMAN which can be
used to obtain "optimal" structural alignments
of structures with any level of sequence homology,
where the definition of "optimal" can be
largely controlled by the user. The program was originally
written in order to quickly sort out "good"
from "bad" hits found by DEJAVU [1], our
program to detect folding similarities, before analysing
them in detail on the display using O [2, 3]. Nevertheless,
the program can be used independently from both DEJAVU
and O, and it can be used just as easily with proteins,
nucleic acids and other molecules.
The simplest task, superimposing molecules given two
sets of atoms which should be matched, is easily accomplished
[4, 5]. The implementation is very similar to that
used in O, except that the atom types which should
be used are freely definable. This means that one
may use, for instance, only Ca atoms, backbone atoms,
all (non-hydrogen) atoms, or a set of user-defined
atom types (e.g., in the case of nucleic acids or small
molecules). An example:
LSQMAN > ex m1 a1-999 m1 b1
WARNING - mol1 == mol2 !
Explicit fit of M1 A1-999
And M1 B1
Atom types |NONH|
Nr of atoms to match : ( 3499)
The 3499 atoms have an RMS distance of 2.311 A
RMS delta B = 7.802 A2
Corr. coeff. = 0.9031
Rotation : 0.382393 -0.058393 0.922153
-0.033219 -0.998225 -0.049435
0.923402 -0.011729 -0.383654
Translation : 5.715 16.617 -8.061
Note that,
apart from the RMS distance of the atoms after superpositioning,
the RMS DB and the linear correlation coefficient of
the temperature factors of the matched atoms are calculated
as well. In the case of NCS-related molecules, and
that of very similar molecules, one would expect RMS
DB to be of the order of ~3-5 Å2, and the correlation
coefficient to be greater than ~0.95.
Note that LSQMAN cannot automatically detect the optimal
alignment of two molecules as some other programs do
[6]. Usually, sets of matching atoms are either trivial
to define (e.g., NCS-related molecules, mutants, complexes),
or non-trivial. In the latter case, we use DEJAVU
[1] first to carry out a rough alignment of the secondary-structure
elements of the protein of interest and all other proteins
in the PDB that appear to show structural similarities.
The rough alignments are then improved with LSQMAN.
* IMPROVING OPERATORS
Optimal alignment of structures with low sequence homology
is somewhat arbitrary, since "optimal" involves
both the number of structurally equivalent residues,
and their RMS distance after alignment. LSQMAN uses
a similar operator-improvement algorithm as that employed
by O [2, 3], i.e.: using an initial operator, consecutive
fragments of residues (using their Ca atoms, for example)
are located whose length exceeds a certain minimum
number of residues, and whose distance to the corresponding
atoms is less than a certain cut-off. These fragments
are used to calculate a new, explicit operator, and
the process is iterated until it converges. Note that
this algorithm is insensitive to sequence gaps so that
it can be used both to find the best-conserved fragments
in similar molecules, and to find the common core of
two completely different molecules. The implementation
in LSQMAN contains some extra "embellishments":
* a sequentiality constraint (optional). If two proteins
have a common motif with the same topology, this is
a useful constraint; on the other hand, if two structures
contain similar arrangements of helices and strands,
but in a different order in their sequences, this constraint
would be switched off.
* the two cut-offs (minimum number of consecutive residues
in matched fragments, and maximum distance between
equivalenced atoms) can either be kept fixed, or allowed
to "decay". For example, one could start
with a distance cut-off of 4 Å to get the overall
operator relating the two molecules, and then multiply
this cut-off by a factor of 0.95 in every iteration
to "zoom in" on the structurally most similar
core fragments of the two.
* the optimisation criterion can be selected by the
user. At present, LSQMAN can optimise: (1) the number
of matched residues (maximise); (2) the RMS distance
of the matched residues (minimise); (3) the Similarity
Index (SI; minimise); or (4) the Match Index (MI; maximise).
The Similarity Index is defined as:
RMSD * min(N1,N2)
SI = ---------------------
Nm
where: N1,2 = number of residues in molecule 1 and 2,
Nm = number of matched residues, and RMSD = their RMS
distance. SI assumes values >= 0.0 Å; the
lower the value of SI, the better the fit and the more
similar the two molecules are. The Match Index is
defined as:
(1 + Nm)
MI = --------------------------------------
(1 + W * RMSD ) * (1 + min(N1,N2))
where W is positive weight (the higher the weight, the
bigger the influence of the RMSD on the value of MI;
suggested values for W are between 0.1 and 1). MI
assumes values between 0 and 1, where "0"
indicates a "perfect mis-match" and "1"
a perfect match.
After the operator improvement has converged (or a maximum
number of cycles has been carried out), the structure-based
sequence alignment is printed. The matched residues
are shown, along with the distance of the atoms that
were used (usually, Ca atoms). If two residues are
of the same type, an asterisk is printed as well.
Also, some statistics pertaining the number and percentage
of matched and conserved residues are printed. An
example:
Found fragment of length : ( 53)
Found fragment of length : ( 260)
Found fragment of length : ( 57)
Found fragment of length : ( 59)
Cycle : ( 10)
Distance cut-off (A) : ( 3.800)
Min fragment length (res) : ( 5)
The 428 atoms have an RMS distance of 0.946 A
SI = RMS * Nmin / Nmatch = 1.01260
MI = (1+Nmatch)/(1+W*RMS)*(1+Nmin) = 0.48022
RMS delta B for matched atoms = 7.610 A2
Corr. coefficient matched atom Bs = 0.908
Rotation : 0.38169697 -0.06605943 0.92192382
-0.04122496 -0.99766684 -0.05441866
0.92336768 -0.01723484 -0.38352972
Translation : 5.7764 17.2442 -8.0352
Fragment SER-A 4 <===> SER-B 4 @ 2.43 A *
SER-A 5 <===> SER-B 5 @ 1.11 A *
ARG-A 6 <===> ARG-B 6 @ 1.19 A *
TYR-A 7 <===> TYR-B 7 @ 0.40 A *
VAL-A 8 <===> VAL-B 8 @ 0.49 A *
ASN-A 9 <===> ASN-B 9 @ 0.21 A *
LEU-A 10 <===> LEU-B 10 @ 0.90 A *
[...]
GLY-A 456 <===> GLY-B 456 @ 3.40 A *
VAL-A 457 <===> VAL-B 457 @ 3.68 A *
Nr of residues in mol1 : ( 459)
Nr of residues in mol2 : ( 458)
Nr of matched residues : ( 428)
Nr of identical residues : ( 428)
% identical of matched : ( 100.000)
% matched of mol1 : ( 93.246)
% identical of mol1 : ( 93.246)
% matched of mol2 : ( 93.450)
% identical of mol2 : ( 93.450)
Statistics can be obtained with the SHow_operator command:
The 428 atoms have an RMS distance of 0.946 A
SI = RMS * Nmin / Nmatch = 1.01260
MI = (1+Nmatch)/(1+W*RMS)*(1+Nmin) = 0.48022
RMS delta B for matched atoms = 7.610 A2
Corr. coefficient matched atom Bs = 0.908
[...]
NCSOP 1 = 0.3816970 -0.0412250 0.9233677
5.776
-0.0660594 -0.9976668 -0.0172348
17.244
0.9219238 -0.0544187 -0.3835297
-8.035
Determinant of rotation matrix = 1.000000
Crowther Alpha Beta Gamma 178.93069 -112.55250
3.37809
Spherical polars Omega Phi Chi 123.71790 177.77631
178.71825
Direction cosines of rotation axis -0.83114
0.03227 -0.55510
Dave Smith -2.57299 -22.79103
-173.83571
Rotation angle = 178.718246
*POOR* - NCS not restrained
*POOR* - NCS Bs not restrained
* OTHER FEATURES
After operator improvement, for example using Ca atoms,
the RMS distance of any set of atoms can be calculated
with the RMsd_calc command. Operators can be stored
as, or read from O datablock files; they can be edited,
and they can be applied to a molecule, for example
for display purposes. In addition, an O macro can
be generated automatically which will read the appropriate
PDB files, apply the current operator(s), and display
the Ca traces of the superimposed molecules.
LSQMAN was originally written as a fast filter between
DEJAVU and O. DEJAVU looks for proteins in the PDB
which appear to display structural similarities to
another protein [1]. However, often many false hits
are found, especially if only weak similarities are
present (in which case one has to use very relaxed
search criteria). DEJAVU produces an O macro to carry
out the structural alignment and to display the hits,
but this takes quite some time to execute. Using LSQMAN
in between, one has a very quick means of separating
"the men from the boys". Therefore, DEJAVU
can be instructed to produce an input file for LSQMAN
to carry out the operator improvement stage for all
hits. LSQMAN, in turn, will produce an O macro to
display the hits superimposed on the search structure
using the automatically improved operators. This macro
can be edited to remove all hits which were false (recognised
by either few matched residues, and/or a very large
RMSD).
Finally, an interesting way of analysing differences
between similar molecules is provided by the option
to produce a DPHI/DPSI plot (essentially, a "difference
Balasubramanian plot"), as suggested by Korn and
Rose [7]. Plots of RMSD as a function of residue number
usually fail to discriminate between "random"
differences and localised differences. For instance,
if a structure has undergone a domain movement around
one or two hinge residues, and one superimposes the
structures using only one domain, the other domain
will have high RMSDs, even though the secondary and
tertiary structure of the second domain as a whole
may be conserved. In such a case, one would expect
the DPHI/DPSI plot to be fairly flat, with some spikes
at the hinge residues. Also, DPHI/DPSI plots show
peptide flips between two structures as isolated spikes
(the PSI angle of residue i, and the PHI angle of residue
i+1 will differ by more than ~150(o)). On the other
hand, if two NCS-related molecules have been heavily
over-modelled (quite common at low resolution when
no NCS constraints or restraints were used in the refinement),
this will show up as a very noisy DPHI/DPSI plot.
This plot facility, together with the calculation of
RMSD and RMS DB values, makes that LSQMAN is also a
useful tool for analysing NCS-related molecules, and
assessing the quality of their refinement.
* AVAILABILITY
LSQMAN is one in a series of "O-dalisques",
i.e. programs that work in conjunction with O. LSQMAN
runs on SGI, ESV and DEC ALPHA/OSF1 workstations.
For more information, contact GJK (E-mail: "gerard@xray.bmc.uu.se").
* REFERENCES
[1] G.J. Kleywegt & T.A. Jones, in "From First
Map to Final Model" (S. Bailey, R. Hubbard &
D. Waller, Eds.), SERC Daresbury Laboratory (1994),
pp. 59-66.
[2] T.A. Jones, J.Y. Zou, S.W. Cowan & M. Kjeldgaard,
Acta Cryst. A47 (1991), 110-119.
[3] T.A. Jones & M. Kjeldgaard, "O - the manual",
Uppsala (1994).
[4] W. Kabsch, Acta Cryst. A32 (1976), 922-923.
[5] W. Kabsch, Acta Cryst. A34 (1978), 827-828.
[6] K. Diederichs, J. Appl. Cryst. 27(1994), 436.
[7] A.P. Korn & D.R. Rose, Prot. Engin. 7 (1994),
961-967.
[8] I. Sinning, G.J. Kleywegt, S.W. Cowan, P. Reinemer,
H.W. Dirr, R. Huber, G.L. Gilliland, R.N. Armstrong,
X. Ji, P.G. Board, B. Olin, B. Mannervik & T.A.
Jones, J. Mol. Biol. 232 (1993), 192-212.
Latest update at 12 February, 1998.