Clustal
V  Multiple Sequence Alignments.
 
            Documentation (Installation and Usage).
 
            Des Higgins
            European Molecular Biology Laboratory
            Postfach 10.2209
            D-6900 Heidelberg
            Germany.
 
            higgins@EMBL-Heidelberg.DE
 
 
*******************************************************************
 
 
            Contents.
 
 
            1           Overview
 
            2           Installation
 
            3           Interactive usage
 
            4           Command-line interface
 
            5           Algorithms and references
 
 
*******************************************************************
 
            1. 
Overview
 
This document describes how to install and use
ClustalV on various 
machines. 
ClustalV is a complete upgrade and rewrite of the Clustal 
package of multiple alignment programs (Higgins and
Sharp, 1988 and 
1989).   The
original programs were written in Fortran for 
microcomputers running MSDOS.   You carried out a complete alignment 
by running 3 programs in succession.   Later, these were merged into 
a single menu driven program with on-line help, for
VAX/VMS.  
ClustalV was written in C and has all of the features
of the old 
programs plus many new ones.  It has been compiled and tested using 
VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun
workstations, Turbo C 
for IBM PC's and Think C for Apple Mac's.   The original Clustal was 
written by Des Higgins while he was a Post-Doc in the
lab of Paul 
Sharp in the Genetics Department, Trinity College,
Dublin 2, 
Ireland. 
 
The main feature of the old package was the ability to
carry out 
reliable multiple alignments of many sequences.  The sensitivity of 
the program is as good as from any other program we
have tried, with 
the exception of the programs of Vingron and Argos
(1991), while it 
works in reasonable time on a microcomputer.  The programs of 
Vingron and Argos are specialised for finding distant
similarities 
between proteins but require mainframes or
workstations and are more 
difficult to use.
 
The main new features are: profile alignments
(alignments of old 
alignments); phylogenetic trees (Neighbor Joining
trees calculated 
after multiple alignment with a bootstrapping option);
better 
sequence input (automatically recognise and read
NBRF/PIR, Pearson 
(Fasta) or EMBL/SwissProt formats); flexible alignment
output 
(choose one of: old Clustal format, NBRF/PIR, GCG msf
format or 
Phylip format); full command line interface
(everything that you can 
do interactively can be specified on the command
line).
 
In version 7 of the GCG package, there is a program
called PILEUP 
which uses a very similar algorithm to the one in
ClustalV.  There 
are 2 main differences between the programs: 1) the
metric used to 
compare the sequences for the initial "guide
tree" uses a full 
global, optimal alignment in PILEUP instead of the
fast, approximate 
ones in ClustalV. 
This makes PILEUP much slower for the comparison 
of long sequences. 
In principle, the distances calculated from 
PILEUP will be more sensitive than ours, but in
practice it will not 
make much difference, except in difficult cases.  2) 
During the 
multiple alignment, terminal gaps are penalised in
ClustalV but not 
in PILEUP. 
This will make the PILEUP alignments better when the 
sequences are of very different lengths (has no effect
if there are 
no large terminal gaps).   
 
 
This software may be distributed and used freely,
provided that you 
do not modify it or this documentation in any way
without the 
permission of the authors.  
 
If you wish to refer to ClustalV, please cite: 
Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL
V: improved software
for multiple sequence alignment.  ms. submitted to CABIOS.  
 
The overall multiple alignment algorithm was described
in:
Higgins,D.G. and Sharp,P.M. (1989).  Fast and sensitive multiple 
sequence alignments on a microcomputer.  CABIOS, vol. 5, 151-153.
 
 
ACKNOWLEDGEMENTS.
 
D.H. would particularly like to thank Paul Sharp, in
whose lab. this 
work originated. 
We also thank Manolo Gouy, Gene Myers, Peter Rice 
and Martin Vingron for suggestions, bug-fixes and
help.    
 
Des Higgins and Rainer Fuchs, 
EMBL Data Library, Heidelberg, Germany.
 
Alan Bleasby,  
Daresbury, UK.
 
JUNE 1991
*******************************************************************
 
            2. 
Installation.
 
 
 
As far as possible, we have tried to make ClustalV
portable to any 
machine with a standard C compiler (proposed ANSI C
standard).  The 
source code, as supplied by us, has been compiled and
tested using 
the following compilers:
 
VAX/VMS C
Ultrix C (on a Decstation 2100)
Gnu C on a Sun 4 workstation
Think C on an Apple Macintosh SE
Turbo C on an IBM AT.
 
In each case, one must make 1 change to 1 line of code
in 1 header 
file.  This is
described below.  The exact capacity of
the program 
(how many sequences of what length can be aligned)
will depend of 
course on available memory but can also be set in this
header file.
 
The package comes as 9 C source files; 3 header files;
1 file of on-
line help; this documentation file; 3 make files:
 
Source code:      clustalv.c,
amenu.c, gcgcheck.c, myers.c, sequence.c, 
                  showpair.c, trees.c, upgma.c, util.c
 
Header files:      clustalv.h,
general.h, matrices.h
 
On-Line help:      clustalv.hlp  (must be renamed or defined as        
                  clustalv_help except on PC's)
 
Documentation:      clustalv.doc
(this file).
 
Makefiles:      makefile.sun
(gnu c on Sun), vmslink.com (vax/vms), 
                  makefile.ult (ultrix).
 
 
 
 
 
 
 
Before compiling ClustalV you must look at and
possibly change 
clustalV.h, shown below..  
 
/*******************CLUSTALV.H********************************/
 
/*
Main header file for ClustalV. Uncomment ONE of the
following lines
depending on which compiler you wish to use.
*/
 
#define VMS 1             /* VAX VMS */
 
/*#define MAC 1           Think_C for MacIntosh */
 
/*#define MSDOS 1         Turbo C for PC's */
 
/*#define
UNIX 1          Ultrix for Decstations
or Gnu C for Sun */
 
/*************************************************************/
 
#include
"general.h"
 
#define
MAXNAMES          10
#define
MAXTITLES         60
#define
FILENAMELEN      256
 
#define
UNKNOWN   0
#define
EMBLSWISS 1
#define
PIR       2
#define
PEARSON   3
 
#define
PAGE_LEN       22
 
#if
VMS
#define
DIRDELIM ']'
#define
MAXLEN          3000
#define
MAXN             150
#define
FSIZE          15000
#define
LINELENGTH        60
#define
GCG_LINELENGTH    50
 
#elif
MAC
#define
DIRDELIM ':'
#define
MAXLEN          2600
#define
MAXN              30
#define
FSIZE          10000
#define
LINELENGTH        50
#define
GCG_LINELENGTH    50
 
#elif
MSDOS
#define
DIRDELIM '\\'
#define
MAXLEN          1300
#define
MAXN              30
#define
FSIZE           5000
#define
LINELENGTH        50
#define
GCG_LINELENGTH    50
 
#elif
UNIX
#define
DIRDELIM '/'
#define
MAXLEN         3000
#define
MAXN             50
#define
FSIZE         15000
#define
LINELENGTH       60
#define
GCG_LINELENGTH   50
#endif
/*****************end*of*CLUSTALV.H***************************/
 
 
 
First,
you must remove the comments from one of the first 10 lines.  
There
are 4 'define' compiler directives here (e.g. #define VMS 1), 
and
you should use one of these, depending on which system you wish 
to
work. So choose one of these, remove its comments (if it is 
already
commented out) and put comments around any of the others 
that
are still active. If you wish to use a different system, you 
will
need to insert a new line with a new keyword (which you must 
invent)
to identify your system.  Most of the
rest of this header 
file
is taken up with a block of 'define' statements for each system 
type;
e.g. the VAX/VMS block is:
 
#if
VMS
#define
DIRDELIM ']'
#define
MAXLEN          3000
#define
MAXN             150
#define
FSIZE          15000
#define
LINELENGTH        60
#define
GCG_LINELENGTH    50
 
In
this block, you can specify the maximum number of sequences to be 
allowed
(MAXN); the maximum sequence length, including gaps 
(MAXLEN);  FSIZE declares the size of some workspace,
used by the 
fast
2 sequence comparison routines and should be APPROXIMATELY 4 
times
MAXLEN; LINELENGTH is the length of the blocks of alignment 
output
in the output files; GCG_LINELENGTH is the same but for the 
GCG
compatible output only.  Finally,
DIRDELIM is the character used 
to
specify directories and subdirectories in file names.  It should 
be
the character used to seperate the file name itself from the 
directory
name (e.g. in VMS, file names are like: 
$drive:[dir1.dir2.dir3]filename.ext;2  so ']' is used as DIRDELIM).   
 
So,
if you want to use a system, not covered in Clustalv.h, you will 
have
to insert a new block, like the above one. 
To compile and link 
the
program, we supply 3 makefiles: one each for VAX/VMS, Ultrix 
and
GNU C for Sun workstations. 
 
 
 
VAX/VMS
 
Compile
and link the program with the 
supplied
makefile for vms: vmslink.com .
 
$
@vmslink
 
This
will produce clustalv.exe (and a lot of .obj files which you can delete).  
 
The
on-line help file (clustalv.hlp) should be 'defined' as 
clustalv_help
as follows:
 
$
def clustalv_help $drive:[dir1.dir2]clustalv.hlp 
 
where
$drive is the drive designation and [dir1.dir2] is the 
directory
where clustalv.hlp is kept.  
 
To
make use of the command-line interface, you must make clustalv a 
'foreign'
command with:
 
$
clustalv :== $$drive:[dir1.dir2]clustalv
 
where
$drive is the drive designation and [dir1.dir2] is the 
directory
where clustalv.exe is kept.  
 
 
 
IBM
PC/MSDOS/TURBO C
 
Create
a makefile (something.prj) with the names of the source files 
(clustalv.c,
amenu.c etc.) and 'make' this using the HUGE memory 
model.  You will get half a dozen warnings from the
compiler about 
pieces
of code than look suspicious to it but ignore these.  The 
help
file should remain as clustalv.hlp .  
To run the program using 
the
default settings in Clustalv.h, you need approximately 500k of 
memory.  To reduce this, the main influence on memory
usage is the 
parameter
MAXLEN; reduce MAXLEN to reduce memory usage.
 
 
 
Apple
Mac/THINK_C version 4.0.2
 
This
version of the program is not at all Mac like. 
It runs in a 
window,
the inside of which looks just like a normal character based 
terminal.  In the future we might put a proper Mac
interface on it 
but
do not have the time right now.  With
the default settings in 
the
header file ClustalV.h, you need just over 800k of memory to run 
the
program.  To reduce this, reduce MAXLEN;
this is easily the 
biggest
influence on memory usage.  To compile
the program and save 
it
as an application you need to 'set the application type'; here 
you
specify how much memory (in kilobytes (k)) the application will 
need.  You should set this to 900k to run the
application as it is 
OR
reduce MAXLEN in the header.  To compile
the program you have to 
create
a 'project'; you 'add' the names of the 9 source files to the 
project
AND the name of the ANSI library.  The
source code is too 
large
to compile in one compilation unit.  You
will get a 'link 
error:
code segment too big' if you try to compile and link as is.  
You
should compile amenu.c (the biggest source file) as a seperate 
unit
..... you will have to read the manual/ask someone/mail me to 
find
out what this is.
 
 
*******************************************************************
 
            3.  Interactive usage.
 
 
 
Interactive
usage of Clustal V is completely menu driven. 
On-line 
help
is provided, defaults are offered for all parameters and file 
names.  With a little effort it should be completely
self 
explanatory.   The main menu, which appears when you run
the 
programs
is shown below.  Each item brings you to
a sub menu.
 
 
 
Main
menu for Clustal V:
 
 
     1. Sequence Input From Disc
     2. Multiple Alignments
     3. Profile Alignments
     4. Phylogenetic trees
 
     S. Execute a system command
     H. HELP
     X. EXIT (leave program)
 
 
Your
choice: 
 
 
 
The
options S and H appear on all the main menus. 
H will provide 
help
and if you type S you will be asked to enter a command, such as 
DIR
or LS, which will be sent to the system (does not work on 
Mac's).  Before carrying out an alignment, you must
use option 1 
(sequence
input); the format for sequences is explained below.  
Under
menu item 2 you will be able to automatically align your 
sequences
to each other.  Menu item 3 allows you
to do profile 
alignments.  These are alignments of old alignments.  This allows 
you
to build up a multiple alignment in stages or add a new sequence 
to
an old alignment.   You can calculate phylogenetic
trees from 
alignments
using menu item 4.
 
 
 
 
      ******************************
      *      
SEQUENCE INPUT.      *
      ******************************
 
 
All
sequences should be in 1 file.  Three
formats are automatically 
recognised
and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and 
Lipman
(1988) format).   
 
***
Users
of the Wisconsin GCG package should use the command TONBRF 
(recently
changed to TOPIR) to reformat their sequences before use. 
***
 
Sequences
can be in upper or lower case.  For
proteins, the only 
symbols
recognised are: 
A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and 
for
DNA/RNA use: A,C,G and T (or U).  Any
other letters of the 
alphabet
will be treated as X (proteins) or N (DNA/RNA) for unknown.  
All
other symbols (blanks, digits etc.) will be ignored EXCEPT for 
the
hyphen "-" which can be used to specify a gap.  This last point 
is
especially useful for 2 reasons: 1) you can fix the positions of 
some
gaps in advance; 2) the alignment output from this program can 
be
written out in NBRF format using "-"'s to specify gaps; these 
alignments
can be used again as input, either for profile alignments 
or
for phylogenetic trees.
 
If
you are using an editor to create sequence files, use the FASTA 
format
as it is by far the simplest (see below). 
If you have access 
to
utility programs for generating/converting the NBRF/PIR format 
then
use it in preference.
 
 
 
FASTA
(PEARSON AND LIPMAN, 1988) FORMAT:    
The sequences are 
delimited
by an angle bracket ">" in column 1.  The text immediately 
after
the ">" is used as a title. 
Everything on the following line 
until
the next ">" or the end of the file is one sequence.
 
e.g.
 
>
RABSTOUT   rabbit Guinness receptor
   LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
  
ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC
>
MUSNOSE   mouse nose drying factor
   
mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
   
fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv
>
HSHEAVEN    human Guinness receptor
repeat
 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk 
ytytytryrwtqtqwtwyt
 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv 
dfsvdgvagvfdv
 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk 
ytytytryrwtqtqwtwyt
 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv 
dfsvdgvagvfdv
 
 
 
NBRF/PIR
FORMAT         is similar to FASTA
format but immediately 
after
the ">", you find the characters "P1;" if the sequences
are 
protein
or "DL;" if they are nucleic acid. 
Clustalv looks for the 
";"
character as the third character after the ">".  If it finds one 
it
assumes that the format is NBRF if not, FASTA format is assumed.  
The
text after the ";" is treated as a sequence name while the 
entire
next line is treated as a title.  The
sequence is terminated 
by
a star "*" and the next sequence can then begin (with a >P1; etc 
).  This is just the basic format description
(there are other 
variations
and rules).
 
ANY
files/sequences in GCG format can be converted to this format 
using
the TONBRF command (now TOPIR) of the Wisconsin GCG package.
 
 
e.g.
 
>P1;RABSTOUT
rabbit
Guinness receptor
LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*
>P1;MUSNOSE   
mouse
nose drying factor
mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd
*
>P1;HSHEAVEN    
human
Guinness receptor repeat protein.
mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv*
 
 
  
 
EMBL/SWISSPROT
FORMAT:       Do not try to create files
with this 
format
unless you have utilities to help.  If
you are just using an 
editor,
use one of the above formats.  If you do
use this format, 
the
program will ignore everything between the ID line (line 
beginning
with the characters "ID") and the SQ line.  The sequence 
is
then read from between the SQ line and the "//" characters.
 
 
 
It
is critically important for the program to know whether or not it 
is
aligning DNA or protein sequences.  The
input routines attempt to 
guess
which type of sequence is being used by counting the number of 
A,C,G,T
or U's in the sequences.  If the total
is more than 85% of 
the
sequence length then DNA is assumed.  If
you use very bizarre 
sequences
(proteins with really strange aa compositions or DNA 
sequences
with loads of strange ambiguity codes) you might confuse 
the
program.  It is difficult to do but be
careful.
 
 
 
 
 
      ******************************
      * 
MULTIPLE ALIGNMENT MENU.  *
      ******************************
 
The
multiple alignment menu is shown below. 
Before explaining how 
to
use it, you must be introduced briefly to the alignment strategy. 
If
you do not follow this, try using option 1 anyway; the entire 
process
will be carried out automatically.
 
To
do a complete multiple alignment, we need to know the approximate 
relationships
of the sequences to each other (which ones are most 
similar
to each other).  We do this by
calculating a crude 
phylogenetic
tree which we call a dendrogram (to distinguish it from 
the
more sensitive trees available under the phylogenetic tree 
menu).   This dendrogram is used as a guide to align
bigger and 
bigger
groups of sequences during the multiple alignment.  The 
dendrogram
is calculated in 2 stages: 1) all pairs of sequence are 
compared
using the fast/approximate method of Wilbur and Lipman 
(1983);
the result of each comparison is a similarity score. 2) the 
similarity
scores are used to construct the dendrogram using the 
UPGMA
cluster analysis method of Sneath and Sokal (1973).  
 
The
construction of the dendrogram can be very time consuming if you 
wish
to align many sequences (e.g. for 100 sequences you need to 
carry
out 100x99/2 sequence comparisons = 4950). During every 
multiple
alignment, a dendrogram is constructed and saved to a file 
(something.dnd).  These can be reused later.
 
 
 
 
 
 
 
 
******Multiple*Alignment*Menu******
 
 
    1. 
Do complete multiple alignment now
    2. 
Produce dendrogram file only
    3. 
Use old dendrogram file
    4. 
Pairwise alignment parameters
    5. 
Multiple alignment parameters
    6. 
Output format options
 
    S. 
Execute a system command
    H. 
HELP
    or press [RETURN] to go back to main menu
 
 
Your
choice: 
 
 
So,
if in doubt, and you have already loaded some sequences from the 
main
menu, just try option 1 and press the <Return> key in response 
to
any questions.  You will be prompted for
2 file names e.g. if the 
sequence
input file was called DRINK.PEP, you will be offered 
DRINK.ALN
as the file to contain the alignment and DRINK.DND for the 
dendrogram.  
 
If
you wish to repeat a multiple alignment (e.g. to experiment with 
different
gap penalties) but do not wish to make a dendrogram all 
over
again use menu item 3(providing you areÊusing the same 
sequences).  Similarly, menu item 2 allows you to produce
the 
dendrogram
file only.
 
 
 
 
PAIRWISE
ALIGNMENT PARAMETERS:     
 
The
parameters that control the initial fast/approximate comparisons 
can
be set from menu item 4 which looks like:
 
 
 ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT
PARAMETERS *********
 
 
     1. Toggle Scoring Method  :Percentage
     2. Gap Penalty            :3
     3. K-tuple                :1
     4. No. of top diagonals   :5
     5. Window size            :5
 
     H. HELP
 
 
Enter
number (or [RETURN] to exit): 
 
 
 
The
similarity scores are calculated from fast alignments generated 
by
the method of Wilbur and Lipman (1983). 
These are 'hash' or 
'word'
or 'k-tuple' alignments carried out in 3 stages.  
 
First
you mark the positions of every fragment of sequence, K-tuple 
long
(for proteins, the default length is 1 residue, for DNA it is 2 
bases)
in both sequences.  Then you locate all
k-tuple matches 
between
the 2 sequences.   At this stage you
have to imagine a dot-
matrix
plot between the 2 sequences with each k-tuple match as a 
dot.   You find those diagonals in the plot with
most matches (you 
take
the "No. of top diagonals" best ones) and mark all diagonals 
within
"Window size" of each top diagonal. 
This process will define 
diagonal
bands in the plot where you hope the most likely regions of 
similarity
will lie.  
 
The
final alignment stage is to find that head to tail arrangement 
of
k-tuple matches from these diagonal regions that will give the 
highest
score.  The score is calculated as the
number of exactly 
matching
residues in this alignment minus a "gap penalty" for every 
gap
that was introduced.  When you toggle
"Scoring method" you 
choose
between expressing these similarity scores as raw scores or 
expressed
as a percentage of the shorter sequence length.  
 
K-TUPLE
SIZE:   Can be 1 or 2 for proteins; 1 to
4 for DNA.  
Increase
this to increase speed; decrease to improve sensitivity.
 
GAP
PENALTY:    The number of matching
residues that must be found 
in
order to introduce a gap.  This should
be larger than K-Tuple 
Size.  This has little effect on speed or
sensitivity.
 
NO.
OF TOP DIAGONALS:    The number of best
diagonals in the 
imaginary
dot-matrix plot that are considered.  Decrease
(must be 
greater
than zero) to increase speed; increase to improve 
sensitivity.
 
WINDOW
SIZE:    The number of diagonals around
each "top" diagonal 
that
are considered.   Decrease for speed;
increase for greater 
sensitivity.
 
SCORING
METHOD: The similarity scores may be expressed as raw scores 
(number
of identical residues minus a "gap penalty" for each gap) or 
as
percentage scores.  If the sequences are
of very different 
lengths,
percentage scores make more sense.
 
 
 
CHANGING
THE PAIRWISE ALIGNMENT PARAMETERS
 
The
main reason for wanting to change the above parameters is SPEED 
(especially
on microcomputers), NOT SENSITIVITY.  
The dendrograms 
that
are produced can only show the relationships between the 
sequences
APPROXIMATELY because the similarity scores are calculated 
from
seperate pairwise alignments; not from a multiple alignment 
(that
is what we eventually hope to produce). 
If the groupings of 
the
sequences are "obvious", the above method should work well; if 
the
relationships are obscure or weakly represented by the data, it 
will
not make much difference playing with the parameters.  The main 
factor
influencing speed is the K-TUPLE SIZE followed by the WINDOW 
SIZE.  
 
The
alignments are carried out in a small amount of memory.  
Occasionally
(it is hard to predict), you will run out of memory 
while
doing these alignments; when this happens, it will say on the 
screen:
"Sequences (a,b) partially aligned" (instead of "Sequences 
(a,b)
aligned").  This means that the
alignment score for these 
sequences
will be approximate;  it is not a
problem unless many of 
the
alignments do this.  It can be fixed by
using less sensitive 
parameters
or increasing parameter FSIZE in clustalv.h .
 
 
THE
DENDROGRAM ITSELF
 
The
similarity scores generated by the fast comparison of all the 
sequences
are used to construct a dendrogram by the UPGMA method of 
Sneath
and Sokal (1973).  This is a form of
cluster analysis and the 
end
result produces something that looks like a tree.  It represents 
the
similarity of the sequences as a hierarchy. 
The dendrogram is 
written
to a file in a machine readable format and is ahown below 
for
an example with 6 sequences.
 
 
    91.0  
0   0   2   012000         ! seq 2 joins seq 3 at 91% ID.
    72.0  
1   0   3   011200         ! seq 4 joins seqs 2,3 at 72%
    71.1  
0   0   2   000012         ! seq 5 joins seq 6 at 71%
    35.5  
0   2   4   122200         ! seq 1 joins seqs 2,3,4
    21.7  
4   3   6   111122         ! seqs 1,2,3,4 join seqs 5,6
 
This
LOOKS complicated but you do not normally need to care what is 
in
here.  Anyway, each row represents the
joining together of 2 or 
more
sequences.  You progress from the top
down, joining more and 
more
sequences until all are joined together; for N sequences you 
have
N-1 groupings hence there are 5 rows in the above file (there 
were
6 sequences).  In each row, the first
number is the similarity 
score
of this grouping; ignore the next three columns for the 
moment;
the last 6 digits in the line show which sequences are 
grouped;
there is one digit for each sequence (the first digit is 
for
the first sequence).  The rule is:  in each row, all of the "1"s 
join
all of the "2"s; the zero's do nothing.   
 
Hence,
in the first row, sequence 2 joins sequence 3 at a similarity 
level
of 91% identity; next, sequence 4 joins the previous grouping 
of
2 plus 3 at a level of 72% etc.   This
is shown diagrammatically 
below.  Before leaving the dendrogram format, the
other 3 columns of 
numbers
are: a pointer to the row from which the "1" sequences were 
last
joined (or zero if only one of them); a pointer to the row in 
which
the "2"s were last joined; the total number of sequences 
joined
in this line.
 
 
 
 
                      I------ 2
               I------I
               I      I------ 3  Diagram of
the sequence similarity 
          I----I
          I   
I------------- 4  relationships
shown in the above 
       I--I
       I 
I------------------ 1  dendrogram
file (branch lengths are
   ----I
       I      
I------------- 5  not to scale).
       I-------I
               I------------- 6
 
 
 
 
 
 
 
 
 
MULTIPLE
ALIGNMENT PARAMETERS:
 
 
Having
calculated a dendrogram between a set of sequences, the final 
multiple
alignment is carried out by a series of alignments of 
larger
and larger groups of sequences.  The
order is determined by 
the
dendrogram so that the most similar sequences get aligned first.  
Any
gaps that are introduced in the early alignments are fixed.  
When
two groups of sequences are aligned against each other, a full 
protein
weight matrix (such as a Dayhoff PAM 250) is used.  Two gap 
penalties
are offered: a "FIXED" penalty for opening up a gap and a 
"FLOATING"
penalty for extending a gap.  
 
 
 ********* MULTIPLE ALIGNMENT PARAMETERS
*********
 
 
     1. Fixed Gap Penalty       :10
     2. Floating Gap Penalty    :10
     3. Toggle Transitions (DNA):Weighted
     4. Protein weight matrix   :PAM 250
 
     H. HELP
 
 
Enter
number (or [RETURN] to exit): 
 
 
FIXED
GAP PENALTY:   Reduce this to encourage
gaps of all sizes; 
increase
it to discourage them.   Terminal gaps
are penalised same 
as
all others.  BEWARE of making this too
small (approx 5 or so); if 
the
penalty is too small, the program may prefer to align each 
sequence
opposite one long gap.
 
FLOATING
GAP PENALTY:   Reduce this to encourage
longer gaps; 
increase
it to shorten them.   Terminal gaps are
penalised same as 
all
others.  BEWARE of making this too small
(approx 5 or so); if 
the
penalty is too small, the program may prefer to align each 
sequence
opposite one long gap.
 
 
DNA
TRANSITIONS = WEIGHTED or UNWEIGHTED:  
By default, transitions 
(A
versus G; C versus T) are weighted more strongly than 
transversions
(an A aligned with a G will be preferred to an A 
aligned
with a C or a T).  You can make all
pairs of nucleotide 
equally
weighted with this option.
 
PROTEIN
WEIGHT MATRIX:  For protein comparisons,
a weight matrix is 
used
to differentially weight different pairs of aligned amino 
acids.  The default is the well known Dayhoff PAM
250 matrix.  We 
also
offer a PAM 100 matrix, an identity matrix (all weights are the 
same
for exact matches) or allow you to give the name of a file with 
your
own matrix.  The weight matrices used by
Clustal V are shown in 
full
in the Algorithms and References section of this documentation.  
 
If
you input a matrix from a file, it must be in the following 
format.  Use a 20x20 matrix only (entries for the 20
"normal" amino 
acids
only; no ambiguity codes etc.).  Input
the lower left triangle 
of
the matrix, INCLUDING the diagonal.  The
order of the amino acids 
(rows
and columns) must be: CSTPAGNDEQHRKMILVFYW. 
The values can be 
in
free format seperated by spaces (not commas). 
The PAM 250 matrix 
is
shown below in this format.
 
  12 
   0  2
  -2 
1  3 
  -3 
1  0  6 
  -2 
1  1  1  2 
  -3 
1  0 -1  1  5 
  -4 
1  0 -1  0  0  2 
  -5 
0  0 -1  0 
1  2  4 
  -5 
0  0 -1  0  0  1 
3  4 
  -5 -1 -1 
0  0 -1  1  2  2  4 
  -3 -1 -1 
0 -1 -2  2  1  1  3  6 
  -4  0
-1  0 -2 -3  0 -1 -1  1  2  6 
  -5 
0  0 -1 -1 -2  1 
0  0  1  0  3  5 
  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0 
0  6 
  -2 -1 
0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 
2  5 
  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4 
2  6 
  -2 -1 
0 -1  0 -1 -2 -2 -2 -2 -2 -2
-2  2 
4  2  4 
  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0 
1  2 -1  9 
   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10 
  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 
 
Values
must be integers and can be all positive or positive and 
negative
as above.  These are SIMILARITY
values.  
 
 
 
 
ALIGNMENT
OUTPUT OPTIONS:
      
By
default, the alignment goes to a file in a self explanatory 
"blocked"
alignment format.  This format is fine
for displaying the 
results
but requires heavy editing if you wish to use the alignment 
with
other software.  To help, we provide 3
other formats which can 
be
turned on or off.  If you have a
sequence data set or alignment 
in
memory, you can also ask for output files in whatever formats are 
turned
on, NOW.  The menu you use to choose
format is shown below.
 
***
We
draw your attention to NBRF/PIR format in particular.  This 
format
is EXACTLY the same as one of the input formats.  Therefore, 
alignments
written in this format can be used again as input (to the 
profile
alignments or phylogenetic trees).
***
 
 
 ********* Format of Alignment Output *********
 
 
     1. Toggle CLUSTAL format output   = 
ON
     2. Toggle NBRF/PIR format output  = 
OFF
     3. Toggle GCG format output       = 
OFF
     4. Toggle PHYLIP format output    = 
OFF
 
     5. Create alignment output file(s) now?
     H. HELP
 
 
Enter
number (or [RETURN] to exit): 
 
 
 
CLUSTAL
FORMAT:     This is a self explanatory
alignment.  The 
alignment
is written out in blocks.  Identities
are highlighted and 
(if
you use a PAM 250 matrix) positions in the alignment where all 
of
the residues are "similar" to each other (PAM 250 score of 8 or 
more)
are indicated.
 
NBRF/PIR
FORMAT:    This is the usual NBRF/PIR
format with gaps 
indicated
by hyphens ("-"). AS we have stressed before, this format 
is
EXACTLY compatible with the sequence input format.  Therefore you 
can
read in these alignments again for profile alignments or for 
calculating
phylogenetic trees.  
 
GCG
FORMAT:         In version 7 of the
Wisconsin GCG package, a new 
multiple
sequence format was introduced.  This is
the MSF (Multiple 
Sequence
Format) format.  It can be used as input
to the GCG 
sequence
editor or any of the GCG programs that make use of multiple 
alignments.   THIS FORMAT IS ONLY SUPPORTED IN VERSION 7
OF THE GCG 
PACKAGE
OR LATER.  
 
PHYLIP
FORMAT:      This format can be used by
the Phylip package of 
Joe
Felsenstein (see the references/algorithms section for details 
of
how to get it).  Phylip allows you to do
a huge range of 
phylogenetic
analyses (we just offer one method in this program) and 
is
probably the most widely used set of programs for drawing trees.
It
also works on just about every computer you can think of, 
providing
you have a decent Pascal compiler.
 
 
 
 
 
      ******************************
      *   PROFILE ALIGNMENT
MENU.  *
      ******************************
 
 
 
This
menu is for taking two old alignments (or single sequences) and 
aligning
them with each other.  The result is one
bigger alignment.  
The
menu is very similar to the multiple alignment menu except that 
there
is no mention of dendrograms here (they are not needed) and 
you
need to input two sets of sequences. 
The menu looks like this:
 
 
 
******Profile*Alignment*Menu******
 
 
    1. 
Input 1st. profile/sequence
    2.  Input 2nd.
profile/sequence
    3. 
Do alignment now
    4. 
Alignment parameters
    5. 
Output format options
 
    S. 
Execute a system command
    H. 
HELP
    or press [RETURN] to go back to main menu
 
 
Your
choice: 
 
 
You
must input profile number 1 first.  
When both profiles are 
loaded,
use item 3 (Do alignment now) and the 2 profiles will be 
aligned.  Items 4 and 5 (parameters and output
options) are 
identical
to the equivalent options on the multiple alignment menu.  
 
The
same input routines that are used for general input are used 
here
i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA 
format,
with gaps indicated by hyphens ("-").  This is why we have 
continualy
drawn your attention to the NBRF/PIR format as a useful 
output
format.  
 
Either
profile can consist of just one sequence.  
Therefore, if you 
have
a favourite alignment of sequences that you are working on and 
wish
to add a new sequence, you can use this menu, provided the 
alignment
is in the correct format.  
 
The
total number of sequences in the two profiles must be less less 
than
or equal to the MAXN parameter set in the clustalv.h header 
file.  
 
 
 
 
 
 
 
 
 
 
 
      ******************************
      *  
PHYLOGENETIC TREE MENU.  *
      ******************************
 
 
This
menu allows you to input an alignment and calculate a 
phylogenetic
tree.  You can also calculate a tree if
you have just 
carried
out a multiple alignment and the alignment is still in 
memory.  THE SEQUENCES MUST BE ALIGNED
ALREADY!!!!!!   The tree will 
look
strange if the sequences are not already aligned.  You can also 
"BOOTSTRAP"
the tree to show confidence levels for groupings.  This 
is
SLOW on microcomputers but works fine on workstations or 
mainframes.
 
 
 
******Phylogenetic*tree*Menu******
 
 
    1. 
Input an alignment
    2. 
Exclude positions with gaps?       
= OFF
    3. 
Correct for multiple substitutions? = OFF
    4. 
Draw tree now
    5. 
Bootstrap tree
 
    S. 
Execute a system command
    H. 
HELP
    or press [RETURN] to go back to main menu
 
 
Your
choice: 
 
 
 
 
The
same input routine that is used for general input is used here 
i.e.
sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, 
with
gaps indicated by hyphens ("-"). 
This is why we have 
continualy
drawn your attention to the NBRF/PIR format as a useful 
output
format.  
 
If
you have input an alignment, then just use item 4 to draw a tree.  
The
method used is the Neighbor Joining method of Saitou and Nei 
(1987).  This is a "distance method".
First, percent divergence 
figures
are calculated between all pairs of sequence. 
These 
divergence
figures are then used by the NJ method to give the tree.  
Example
trees will be shown below.  
 
There
are two options which can be used to control the way the 
distances
are calculated.  These are set by
options 2 and 3 in the 
menu.  
 
EXCLUDE
POSITIONS WITH GAPS?   This option
allows you to ignore all 
alignment
positions (columns) where there is a gap in ANY sequence.  
This
guarantees that "like" is compared with "like" in all
distances 
i.e.
the same positions are used to calculate all distances.  It 
also
means that the distances will be "metric".  The disadvantage of 
using
this option is that you throw away much of the data if there 
are
many gaps.  If the total number of gaps
is small, it has little 
effect.  
 
CORRECT
FOR MULTIPLE SUBSTITUTIONS?    As
sequences diverge, 
substitutions
accumulate.  It becomes increasingly
likely that more 
than
one substitution (as a result of a mutation) will have happened 
at
a site where you observe just one difference now.  This option 
allows
you to use formulae developed by Motoo Kimura to correct for 
this
effect.  It has the effect of stretching
long branches in tres 
while
leaving short ones relatively untouched. 
The desired effect 
is
to try and make distances proportional to time since divergence.  
 
The
tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the 
name
of the input, alignment file.  An
example is shown below for 6 
globin
sequences.  
 
 
 
 DIST  
= percentage divergence (/100)
 Length = number of sites used in comparison
 
   1 vs.  
2  DIST = 0.5683;  length =   
139
   1 vs.  
3  DIST = 0.5540;  length =   
139
   1 vs.  
4  DIST = 0.5315;  length =   
111
   1 vs.  
5  DIST = 0.7447;  length =   
141
   1 vs.  
6  DIST = 0.7571;  length =   
140
   2 vs.  
3  DIST = 0.0897;  length =   
145
   2 vs.  
4  DIST = 0.1391;  length = 
  115
   2 vs.  
5  DIST = 0.7517;  length =   
145
   2 vs.  
6  DIST = 0.7431;  length =   
144
   3 vs.  
4  DIST = 0.0957;  length =   
115
   3 vs.  
5  DIST = 0.7379;  length =   
145
   3 vs.  
6  DIST = 0.7361;  length =   
144
   4 vs.  
5  DIST = 0.7304;  length =   
115
   4 vs.  
6  DIST = 0.7368;  length =   
114
   5 vs.  
6  DIST = 0.2697;  length =   
152
 
 
                  Neighbor-joining
Method
 
 Saitou, N. and Nei, M. (1987) The
Neighbor-joining Method:
 A New Method for Reconstructing Phylogenetic
Trees.
 Mol. Biol. Evol., 4(4), 406-425
 
 
 This is an UNROOTED tree
 
 Numbers in parentheses are branch lengths
 
 
 Cycle  
1     =  SEQ:   5 (  0.13382) joins  SEQ:   6 (  0.13592)
 
 Cycle  
2     =  SEQ:   1 (  0.28142) joins Node:   5 ( 
0.33462)
 
 Cycle  
3     =  SEQ:   2 (  0.05879) joins  SEQ:   3 (  0.03086)
 
 Cycle  
4 (Last cycle, trichotomy):
 
             Node:   1 (  0.20798) joins
             Node:   2 (  0.02341) joins
             
SEQ:   4 (  0.04915) 
 
 
 
The
output file first shows the percent divergence (distance) 
figures
between each pair of sequence.  Then a
description of a NJ 
tree
is given.  This description shows which
sequences (SEQ:) or 
which
groups of sequences (NODE: , a node is numbered using the 
lowest
sequence that belongs to it) join at each level of the tree.  
 
This
is an unrooted tree!! This means that the direction of 
evolution
through the tree is not shown.  This can
only be inferred 
in
one of two ways:  
1)
assume a degree of constancy in the molecular clock and place the 
root
(bottom of the tree; the point where all the sequences radiate 
from)
half way along the longest branch.    
**OR**
2)
use an "outgroup", a sequence from an organism that you
"know" 
must
be outside of the rest of the sequences i.e. root the tree 
manually,
on biological grounds.
 
The
above tree can be represented diagramatically as follows:
 
 
                          SEQ 1       SEQ 4
                           I           I
          13.6             I 28.1      I 4.9          5.9
  SEQ 6 ----------I        I           I          I--------- SEQ 2
                  I        I           I          I
                 
I--------I-----------I----------I
          13.4    I  33.5      20.8         2.3   I   3.1
  SEQ 5 ----------I                               I--------- SEQ 3
 
 
The
figures along each branch are percent divergences along that 
branch.  If you root the tree by placing the root
along the longest 
branch
(33.5%) then you can draw it again as follows, this time 
rooted:
 
 
 
                        13.6
                I-------------------- SEQ 6
      I---------I       13.4
      I         I-------------------- SEQ 5
      I 33.5 
 -----I                 28.1
      I         I-------------------- SEQ 1
      I         I
      I---------I                4.9
                I  20.8  I----------- SEQ 4
                I--------I  
                         I       5.9
                         I 2.3 I----- SEQ 2
                         I-----I 3.1
                               I----- SEQ 3
 
 
 
The
longest branch (33.5% between 5,6 and 1,2,3,4) is split between 
the
2 bottom branches of the tree.  As it
happens in this particular 
case,
sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 
are
alpha and beta globins, so you could also justify the above 
rooting
on biological grounds.  If you do not
have any particular 
need
or evidence for the position of the root, then LEAVE THE TREE 
UNROOTED.  Unrooted trees do not look as pretty as
rooted ones but 
it
is uaual to leave them unrooted if you do not have any evidence 
for
the position of the root.
 
 
BOTSTRAPPING:    Different sets of sequences and different
tree 
drawing
methods may give different topologies (branching orders) for 
parts
of a tree that are weakly supported by the data.  It is useful 
to
have an indication of the degree of error in the tree.  There are 
several
ways of doing this, some of them rather technical.  We 
provide
one general purpose method in this program, which makes use 
of
a technique called bootstrapping (see Felsenstein, 1985).
 
In
the case of sequence alignments, bootstrapping involves taking 
random
samples of positions from the alignment. 
If the alignment 
has
N positions, each bootstrap sample consists of a random sample 
of
N positions, taken WITH REPLACEMENT i.e. in any given sample, 
some
sites may be sampled several times, others not at all.  Then, 
with
each sample of sites, you calculate a distance matrix as usual 
and
draw a tree.  If the data very strongly
support just one tree 
then
the sample trees will be very similar to each other and to the 
original
tree, drawn without bootstrapping. 
However, if parts of 
the
tree are not well supported, then the sample trees will vary 
considerably
in how they represent these parts.
 
In
practice, you should use a very large number of bootstrap 
replicates
(1000 is recommended, even if it means running the 
program
for an hour on a slow microcomputer; on a workstation it 
will
be MUCH faster).  For each grouping on
the tree, you record the 
number
of times this grouping occurs in the sample trees.  For a 
group
to be considered "significant" at the 95% level (or P <= 0.05 
in
statistical terms) you expect the grouping to show up in >= 95% 
of
the sample trees.  If this happens, then
you can say that the 
grouping
is significant, given the data set and the method used to 
draw
the tree.  
 
So,
when you use the bootstrap option, a NJ tree is drawn as before 
and
then you are asked to say how many bootstrap samples you want 
(1000
is the default) and you are asked to give a seed number for 
the
random number generator.  If you give
the same seed number in 
future,
you will get the same results (we hope). 
Remember to give 
different
seed numbers if you wish to carry out genuinely different 
bootstrap
sampling experiments.  Below is the
output file from using 
the
same data for the 6 globin sequences as used before.  The output 
file
has the same name as the input fike with the extension ".njb".
 
//
STUFF
DELETED  .... same as for the ordinary
NJ output
//
                  Bootstrap
Confidence Limits
 
 
 Random number generator seed =      99
 
 Number of bootstrap trials   =   
1000
 
 
 Diagrammatic representation of the above
tree: 
 
 Each row represents 1 tree cycle; defining 2
groups.
 
 Each column is 1 sequence; the stars in each
line show 1 group; 
 the dots show the other
 
 Numbers show occurences in bootstrap samples.
 
****..   1000              
.***..   1000                <- This is the answer!!
*..***    812 
122311
 
 
For
an unrooted tree with N sequences, there are actually only N-3 
genuinely
different groupings that we can test (this is the number 
of
"internal branches"; each internal branch splits the sequences 
into
2 groups).  In this example, we have 6
sequences with 3 
internal
branches in the reference tree.  In the
bootstrap 
resampling,
we count how often each of these internal branches 
occur.  Here, we find that the branch which splits
1,2,3 and 4 
versus
1 and 2 occurs in all 1000 samples; the branch which splits 
2,3
and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2 
and
3 versus 1,4,5 and 6 occurs in 812/1000 samples.  We can put 
these
figures on to the diagrammatic representation we made earlier 
of
our unrooted NJ tree as follows:
 
 
 
                          SEQ 1       SEQ 4
                           I           I
                           I           I            
  SEQ 6 ----------I        I           I          I--------- SEQ 2
                  I  1000  I   1000   
I   812    I
                 
I--------I-----------I----------I
                  I                               I   
  SEQ 5 ----------I                               I--------- SEQ 3
 
 
 
You
can equally put these confidence figures on the rooted tree (in 
fact
the interpretation is simpler with rooted trees).  With the 
unrooted
tree, the grouping of sequence 5 with 6 is significant (as 
is
the grouping of sequences 1,2,3 and 4). 
Equally the grouping of 
sequences
1,5 and 6 is significant (the same as saying that 2,3 and 
4
group significantly).  However, the
grouping of 2 and 3 is not 
significant,
although it is relatively strongly supported. 
 
Unfortunately,
there is a small complication in the interpretation 
of
these results.  In statistical
hypothesis testing, it is not 
valid
to make multiple simultaneous tests and to treat the result of 
each
test completely independantly.  In the
above case, if you have 
one
particular test (grouping) that you wish to make in advance, it 
is
valid to test IT ALONE and to simply show the other bootstrap 
figures
for reference.  If you do not have any
particular test in 
mind
before you do the bootstrapping, you can just show all of the 
figures
and use the 95% level as an ARBITRARY cut off to show those 
groups
that are very strongly supported; but not mention anything 
about
SIGNIFICANCE testing.  In the
literature, it is common 
practice
to simply show the figures with a tree; they frequently 
speak
for themselves.  
 
 
 
*******************************************************************
 
            4.  Command Line Interface.
 
 
 
You
can do almost everything that can be done from the menus, using 
a
command line interface. In this mode, the program will take all of 
its
instructions as "switches" when you activate it; no questions 
will
be asked; if there are no errors, the program just does an 
analysis
and stops.   It does not work so well on
the MAC but is 
still
possible.  To get you started we will
show you the 2 simplest 
uses
of the command line as it looks on VAX/VMS. 
On all other 
machines
(except the MAC) it works in the same way.
 
$
clustalv /help           **OR**   $ clustalv /check
 
Both
of the above switches give you a one page summary of the 
command
line on the screen and then the program stops. 
 
 
$
clustalv proteins.seq    **OR**   $ clustalv /infile=proteins.seq    
 
This
will read the sequences from the file 'proteins.seq' and do a 
complete
multiple alignment.  Default parameters
will be used, the 
program
will try to tell whether or not the sequences are DNA or 
protein
and the output will go to a file called 'proteins.aln' . A 
dendrogram
file called 'proteins.dnd' will also be created.  Thus 
the
default action for the program, when it successfully reads in an 
input
file is to do a full multiple alignment. 
Some further 
examples
of command line usage will be given leter.
 
Command
line switches can be abbreviated but MAKE SURE YOU DO NOT 
MAKE
THEM AMBIGUOUS.  No attempt will be made
to detect ambiguity.  
Use
enough characters to distinguish each switch uniquely.
 
 
 
 
 
 
 
The
full list of allowed switches is given below:
 
 
                DATA (sequences)
 
/INFILE=file.ext    :input sequences.  If you give an input file and 
                        nothing else as a switch, the default
action is 
                        to do a complete multiple alignment.  The input 
                        file can also be specified by giving it as
the 
                        first command line parameter with no
"/" in    
                        front of it e.g $ clustalv file.ext  .
 
/PROFILE1=file.ext      :You use these two switches to give the
names of  
/PROFILE2=file.ext      two profiles.  The default action is to align 
                        the two. You must give the names of both
profile 
                        files. 
 
 
 
                VERBS (do things)
 
/HELP             :list the command line parameters on the
screen.
/CHECK           
                
/ALIGN              :do
full multiple alignment.  This is the
default     
                  action
if no other switches except for input files 
                  are
given.
 
/TREE            :calculate
NJ tree.  If this is the only action       
                  specified
(e.g. $ clustalv proteins.seq/tree ) it IS 
                  ASSUMED
THAT THE SEQUENCES ARE ALREADY ALIGNED. 
If 
                  the
sequences are not already aligned, you should      
                  also
give the /ALIGN switch.  This will align
the   
                  sequences
first, output an alignment file and    
                  calculate
the tree in memory. 
 
/BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of
bootstraps;       
                  default
= 1000).  If this is the only action            
                  specified
(e.g. $ clustalv proteins.seq/bootstrap ) 
                  it
IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  
                  If
the sequences are not already aligned, you should 
                  also
give the /ALIGN switch.  This will align
the   
                  sequences
first, output an alignment file and    
                  calculate
the bootstraps in memory.  You can set
the 
                  number
of bootstrap trials here (e.g./bootstrap=500). 
                  You
can set the seed number for the random number      
                  generator
with /seed=n.
 
 
 
                PARAMETERS (set things)
 
***Pairwise
alignments:***
 
/KTUP=n            :word
size              
    
/TOPDIAGS=n        :number
of best diagonals
 
/WINDOW=n          :window
around best diagonals 
 
/PAIRGAP=n         :gap
penalty
 
 
 
***Multiple
alignments:***
 
/FIXEDGAP=n        :fixed
length gap pen.  
    
/FLOATGAP=n        :variable
length gap pen.
 
/MATRIX=           :PAM100
or ID or file name. The default weight matrix 
                  for
proteins is PAM 250.
 
/TYPE=p
or d       :type is protein or DNA.   This allows you to      
                  explicitely
overide the programs attempt at guessing 
                  the
type of the sequence.  It is only useful
if you 
                  are
using sequences with a VERY strange composition.
 
/OUTPUT=           :GCG
or PHYLIP or PIR.  The default output is
  
                  Clustal
format.
    
/TRANSIT           :transitions
not weighted.  The default is to weight 
                  transitions
as more favourable than other mismatches 
                  in
DNA alignments.  This switch makes all
nucleotide 
                  mismatches
equally weighted.
 
 
***Trees:***                             
 
/KIMURA            :use
Kimura's correction on distances.   
 
/TOSSGAPS          :ignore
positions with a gap in ANY sequence.
 
/SEED=n            :seed
number for bootstraps.
 
 
 
 
EXAMPLES:
 
These
examples use the VAX/VMS $ prompt; otherwise, command-line 
usage
is the same on all machines except the Macintosh.
 
 
$
clustalv proteins.seq      OR     $ clustalv /infile=proteins.seq
 
Read
whatever sequences are in the file "proteins.seq" and do a full 
multiple
alignment; output will go to the files: "proteins.dnd" 
(dendrogram)
and "proteins.aln" (alignment).
 
 
$
clustalv proteins.seq/ktup=2/matrix=pam100/output=pir
 
Same
as last example but use K-Tuple size of 2; use a PAM 100 
protein
weight matrix; write the alignment out in NBRF/PIR format 
(goes
to a file called "proteins.pir").
 
 
$
clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11
 
Take
the alignment in "proteins.seq" and align it with
"more.seq" 
using
default values for everything except the fixed gap penalty 
which
is set to 11.  The sequence type is
explicitely set to 
PROTEIN.
 
 
$
clustalv proteins.pir/tree/kimura
 
Take
the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) 
and
calculate a phylogenetic tree using Kimura's correction for 
distances.  
 
 
$
clustalv proteins.pir/align/tree/kimura
 
Same
as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE 
FIRST.
 
 
$
clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p
 
Take
the sequences in proteins.seq; they are explicitely set to be 
protein;
align them; bootstrap a tree using 500 samples and a seed 
number
of 99.
 
 
*******************************************************************
 
            5.  Algorithms and references.
 
 
 
In
this section, we will try to BRIEFLY describe the algorithms used 
in
ClustalV and give references.  The
topics covered are:
 
 
      -Multiple alignments
 
      -Profile alignments
 
      -Protein weight matrices
 
      -Phylogenetic trees
 
            -distances
 
            -NJ
method
 
            -Bootstrapping
 
            -Phylip
 
      -References
 
 
 
 
 
 
MULTIPLE
ALIGNMENTS.
 
The
approach used in ClustalV is a modified version of the method of 
Feng
and Doolittle (1987) who aligned the sequences in larger and 
larger
groups according to the branching order in an initial 
phylogenetic
tree.  This approach allows a very
useful combination 
of
computational tractability and sensitivity. 
 
The
positions of gaps that are generated in early alignments remain 
through
later stages.  This can be justified
because gaps that arise 
from
the comparison of closely related sequences should not be moved 
because
of later alignment with more distantly related sequences.  
At
each alignment stage, you align two groups of already aligned 
sequences.  This is done using a dynamic programming
algorithm where 
one
allows the residues that occur in every sequence at each 
alignment
position to contribute to the alignment score. 
A Dayhoff 
(1978)
PAM matrix is used in protein comparisons.
 
The
details of the algorithm used in ClustalV have been published in 
Higgins
and Sharp (1989).  This was an improved
version of an 
earlier
algorithm published in Higgins and Sharp (1988).  First, you 
calculate
a crude similarity measure between every pair of sequence.  
This
is done using the fast, approximate alignment algorithm of 
Wilbur
and Lipman (1983).  Then, these scores
are used to calculate 
a
"guide tree" or dendrogram, which will tell the multiple alignment 
stage
in which order to align the sequences for the final multiple 
alignment.  This "guide tree" is calculated
using the UPGMA method 
of
Sneath and Sokal (1973).  UPGMA is a
fancy name for one type of 
average
linkage cluster analysis, invented by Sokal and Michener 
(1958).  
 
Having
calculated the dendrogram, the sequences are aligned in 
larger
and larger groups.  At each alignment
stage, we use the 
algorithm
of Myers and Miller (1988) for the optimal alignments.  
This
algorithm is a very memory efficient variation of Gotoh's 
algorithm
(Gotoh, 1982).  It is because of this
algorithm that 
ClustalV
can work on microcomputers.   Each of
these alignments 
consists
of aligning 2 alignments, using what we call "profile 
alignments".
 
 
PROFILE
ALIGNMENTS.
 
We
use the term "profile alignment" to describe the alignment of 2 
alignments.  We use this term because the method is a
simple 
extension
of the profile method of Gribskov, et al. (1987) for 
aligning
1 sequence with an alignment.  Normally,
with a 2 sequence 
alignment,
you use a weight matrix (e.g. a PAM 250 matrix) to give a 
score
between the pairs of aligned residues. 
The alignment is 
considered
"optimal" if it gives the best total score for aligned 
residues
minus penalties for any gaps (insertions or deletions) that 
must
be introduced.  
 
Profile
alignments are a simple extension of 2 sequence alignments 
in
that you can treat each of the two input alignments as single 
sequences
but you calculate the score at aligned positions as the 
average
weight matrix score of all the residues in one alignment 
versus
all those in the other e.g. if you have 2 alignments with I 
and
J sequences respectively; the score at any position is the 
average
of all the I times J scores of the residues compared 
seperately.  Any gaps that are introduced are placed in
all of the 
sequences
of an alignment at the same position. 
The profile 
alignments
offered in the "profile alignment menu" are also 
calculated
in this way.
 
 
PROTEIN
WEIGHT MATRICES.
 
There
are 3 built-in weight matrices used by clustalV.  These are 
the
PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity 
matrix.  Each matrix is given as the bottom left
half, including the 
diagonal
of a 20 by 20 matrix.  The order of the
rows and columns is 
CSTPAGNDEQHRKMILVFYW.
 
 
PAM
250
 
C  12 
S   0  2
T  -2 
1  3 
P  -3 
1  0  6 
A  -2 
1  1  1  2 
G  -3 
1  0 -1  1  5 
N  -4 
1  0 -1  0  0  2 
D  -5 
0  0 -1  0  1  2  4 
E  -5 
0  0 -1  0  0  1 
3  4 
Q  -5 -1 -1  0  0 -1  1 
2  2  4 
H  -3 -1 -1 
0 -1 -2  2  1  1  3  6 
R  -4  0
-1  0 -2 -3  0 -1 -1  1  2  6 
K  -5 
0  0 -1 -1 -2  1 
0  0  1  0  3  5 
M  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0 
0  6 
I  -2 -1 
0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 
2  5 
L  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4 
2  6 
V  -2 -1 
0 -1  0 -1 -2 -2 -2 -2 -2 -2
-2  2 
4  2  4 
F  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0 
1  2 -1  9 
Y   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10 
W  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 
----------------------------------------------------------------
    C 
S  T  P  A  G 
N  D  E  Q  H  R  K 
M  I  L  V  F 
Y  W
 
 
IDENTITY
MATRIX
 
10
 0  10 
 0 
0  10 
 0 
0  0  10 
 0 
0  0  0  10 
 0 
0  0  0  1  10 
 0 
0  0  0  0  0  10
 0 
0  0  0  0  0 
0  10 
 0 
0  0  0  0  0 
0  0  10 
 0 
0  0  0  0  0 
0  0  0  10 
 0 
0  0  0  0  0 
0  0  0  0  10 
 0 
0  0  0  0  0 
0  0  0  0  0  10
 0 
0  0  0  0  0 
0  0  0  0  0  0  10 
 0 
0  0  0  0  0  0  0 
0  0  0  0  0  10
 0 
0  0  0  0  0 
0  0  0  0  0  0  0 
0  10 
 0 
0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  10 
 0 
0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  10 
 0 
0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  10 
 0 
0  0  0  0  0 
0  0  0  0  0 
0  0  0  0  0 
0  0 10 
 0 
0  0  0  0  0 
0  0  0  0  0  0  0 
0  0  0  0  0  0
10
 
 
 
 
 
PAM
100
 
 14 
 -1  6 
 -5 
2   7 
 -6 
1  -1  10 
 -5 
2   2   1   6 
 -8 
1  -3  -3   1   8 
 -8 
2   0  -3  -1  -1  7
-11
-1  -2 
-4  -1  -1  4   8 
-11
-2  -3 
-3   0  -2  1   5  
8 
-11
-3  -3 
-1  -2  -5 -1   1   4  
9 
 -6 -4 
-5  -2  -5  -7  2 
-1  -2   4 11 
 -6 -1 
-4, -2  -5  -8 -3 
-6  -5   1  1 10 
-11
-2  -1 
-4  -4  -5  1  -2 
-2  -1 -3  3  8 
-11
-4  -2 
-6  -3  -8 -5  -8  -6 
-2 -7 -2  1 13 
 -5 -4 
-1  -6  -3  -7 -4  -6 
-5  -5 -7 -4  4 
2  9 
-12
-7  -5 
-5  -5  -8 -6  -9  -7 
-3 -5 -7 -6  4  2  9 
 -4 -4 
-1  -4   0  -4 -5  -6 
-5  -5 -6 -6 -6  1 
5  1  8 
-10
-5  -6 
-9  -7  -8 -6 -11 -11 -10 -4 -7-11 -2 
0  0 -5 12 
 -2 -6 
-6 -11  -6 -11 -3  -9 
-7  -9 -1-10-10 -8 -4 -5 -6  6 13 
-13
-4 -10 -11 -11 -13 -8 -13 -14 -11 -7  1
-9-11-12 -7-14 -2 -2 19 
 
 
 
 
PHYLOGENETIC
TREES.
 
There
are two COMMONLY used approaches for inferring phylogentic 
trees
from sequence data: parsimony and distance methods. There are 
other
approaches which are probably superior in theory but which are 
yet
to be used widely. This does not mean that they are no use; we 
(the
authors of this program at any rate) simply do not know enough 
about
them yet.  You should see the
documentation accompanying the 
Phylip
package and some of the references there for an explanation 
of
the different methods and what assumptions are implied when you 
use
them.   
 
There
is a constant debate in the literature as to the merits of 
different
methods but unfortunately, a lot of what is said is 
incomprehensible
or inaccurate.  It is also a field that
is prone to 
having
highly opinionated schools of thought. 
This is a pity 
because
it prevents rational discussion of the pro's and con's of 
the
different methods.  The approach adopted
in ClustalV is to 
supply
just one method and to produce alignments in a format that 
can
be used by Phylip.  In simple cases, the
trees produced will be 
as
"good" (reliable, robust) as those from ANY other method.  In 
more
complicated cases, there is no single magic recipe that we can 
supply
that will work well in even most situations.
 
The
method we provide is the Neighbor Joining method (NJ) of Saitou 
and
Nei (1987) which is a distance method. 
We use this for three 
reasons:  it is conceptually and computationally
simple; it is fast; 
it
gives "good" trees in simple cases. It is difficult to prove that 
one
tree is "better" than another if you do not know the true 
phylogeny;
the few systematic surveys of methods show it to work 
more
or less as well as any other method ON AVERAGE.  Another reason 
for
using the NJ method is that it is very commonly used; THIS IS A 
BAD
REASON SCIENTIFICALLY but at least you will not feel lonely if 
you
use it.
 
The
NJ method works on a matrix of distances (the distance matrix) 
between
all pairs of sequence to be analysed. 
These distances are 
related
to the degree of divergence between the sequences.  It is 
normal
to calculate the distances from the sequences after they are 
multiply
aligned.  If you calculate them from
seperate alignments 
(as
done for the dendrograms in another part of this program), you 
may
increase the error considerably.  
 
 
DISTANCES
 
The
simplest measure of distance between sequences is percent 
divergence
(100% minus percent identity).  For two
sequences, you 
count
how many positions differ between them (ignoring all positions 
with
a gap or an unknown residue) and divide by the number of 
positions
considered.  It is common practice to
also ignore all 
positions
in the alignment where there is a GAP in ANY of the 
sequences
(Tossgaps ? option in the menu). 
Usually, you express the 
percent
distance divided by 100 (gives distances between 0.0 and 
1.0).
 
This
measure of distance is perfectly adequate (with some further 
modification
described below) for rRNA sequences. However it treats 
all
residues identically e.g. all amino acid substitutions are 
equally
weighted. It also treats all positions identically e.g. it 
does
not take account of different rates of substitution in 
different
positions of different codons in protein coding DNA 
sequences;
see Li et al (1985) for a distance measure that does.  
Despite
these shortcomings, these percent identity distances do work 
well
in practice in a wide variety of situations. 
 
In
a simple world, you would like a distance to be proportional to 
the
time since the sequences diverged.  If
this were EXACTLY true, 
then
the calculation of the tree would be a simple matter of algebra 
(UPGMA
does this for you) and the branch lengths will be nice and 
meaningful
(times).  In practice this OBVIOUSLY
depends on the 
existence
and quality of the "molecular clock", a subject of on-
going
debate.  However, even if there is a
good clock, there is a 
further
problem with estimating divergences.  As
sequences diverge, 
they
become "saturated" with mutations. 
Sites can have 
substitutions
more than once.  Calculated distances
will 
underestimate
actual divergence times; the greater the divergence, 
the
greater the discrepancy.  There are
various methods for dealing 
with
this and we provide two commonly used ones, both due to Motoo 
Kimura;
one for proteins and one for DNA. 
 
 
For
distance K (percent divergence /100 ) ...
 
Correction
for Protein distances:  (Kimura, 1983).
 
       Corrected K = -ln(1.0 - K - (K *
k/5.0))
 
 
 
Correction
for nucleotide distances: Kimura's 2-parameter method 
(Kimura,
1980).
 
       Corrected K = 0.5*ln(a) + 0.25*ln(b)
 
       where     a = 1/(1 - 2*P - Q)
       and       b = 1/(1 - 2*Q)
 
       P and Q are the proportions of
transitions (A<-->G, C<-->T)
       and transversions occuring between the
sequences.  
 
 
One
paradoxical effect of these corrections, is that distances can 
be
corrected to have more than 100% divergence. 
That is because, 
for
very highly diverged sequences of length N, you can estimate 
that
more than N substitutions have occured by correcting the 
observed
distance in the above ways.  Don't
panic!
 
 
 
NEIGHBOR
JOINING TREES.
 
VERY
briefly, the NJ method works as follows. 
You start by placing 
the
sequences in a star topology (no internal branches).  You then 
find
that internal branch (take 2 sequences; join them; connect them 
to
the rest by the internal branch) which when added to the tree 
will
minimise the total branch length. The two joined sequences 
(neighbours)
are merged into a single sequence and the process is 
repeated.  For an unrooted tree with N sequences, there
are N-3 
internal
branches.  The above process is repeated
N-3 times to give 
the
final tree.  The full details are given
in Saitou and Nei 
(1987).
 
As
explained elsewhere in the documentation, you can only root the 
tree
by one of two methods:
 
1)
assume a degree of constancy in the molecular clock and place the 
root
along the longest branch (internal or external).  Methods that 
appear
to produce rooted trees automatically are often just doing 
this
without letting you know; this is true of UPGMA.
 
2)
root the tree on biological grounds. 
The usual method is to 
include
an "outgroup", a sequence that you are certain will branch 
to
the outside of the tree.  
 
 
 
BOOTSTRAPPING.
 
Bootstrapping
is a general purpose technique that can be used for 
placing
confidence limits on statistics that you estimate without 
any
knowledge of the underlying distribution (e.g. a normal or 
poisson
distribution).  In the case of
phylogenetic trees, there are 
several
analytical methods for placing confidence limits on 
groupings
(actually on the internal branches) but these are either 
restricted
to particular tree drawing methods or only work on small 
trees
of 4 or 5 sequences.  Felsenstein (1985)
showed how to use 
bootstrapping
to calculate confidence limits on trees. 
His approach 
is
completely general and can be applied to any tree drawing method.  
The
main assumption of the method in this context is that the sites 
in
the alignment are independant; this will be true of some sequence 
alignments
(e.g. pseudogenes) but not others (e.g. rRNA's).  What 
effect,
lack of independance will have on the results is not known.
 
The
method works by taking random samples of data from the complete 
data
set.  You compute the test statistic
(tree in this case) on 
each
sample.   Variation in the statistic
computed from the samples 
gives
a measure of variation in the statistic which can be used to 
calculate
confidence intervals.  Each random
sample is the same size 
as
the complete data set and is taken WITH REPLACEMENT i.e. a data 
point
can be selected more than once (or not at all) in any given 
sample.  
 
In
the case of an alignment N residues long, each random sample is a 
random
selection of N sites form the alignment. 
For each sample, we 
calculate
a distance matrix and tree in the usual way. 
Variation in 
the
sample trees compared to a tree calculated from the full data 
set
gives an indication of how well supported the tree is by the 
data.  If the sample trees are very similar to each
other and to the 
full
tree, then the tree is "strongly" supported; if the sample 
trees
show great variation, then the tree will be weakly supported.  
In
practice, you usually find some parts of a tree well supported, 
others
weakly.  This can be seen by counting
how often each 
monophyletic
group in the full tree occurs in the sample trees.  
 
For
a particular grouping, one considers it to be significant at the 
95%
level (P <= 0.05) if it occurs in 95% of the bootstrap samples. 
If
a grouping is significant, it is significant with respect to the 
particular
data set and method used for drawing the tree. 
Biological
"significance" is another matter.
 
 
PHYLIP.
 
The
Phylip package was written by Joe Felsenstein, University of 
Washington,
USA.  It provides Pascal source code for
a large number 
of
programs for doing most types of phylogenetic analyses.  The 
Phylip
format alignments produced by this program can be used by all 
of
the Phylip programs, version 3.4 or later (March 1991).  It is 
freely
available from him as follows.
 
 
 
=================
PHYLIP information sheet =====================
 
    PHYLIP - Phylogeny Inference Package
(version 3.3)
 
This
is a FREE package of programs for inferring phylogenies and 
carrying
out certain related tasks.  At present
it contains 28 
programs,
which carry out different algorithms on different kinds of 
data.  The programs in the package are:
 
      ---------- Programs for molecular
sequence data ----------
PROTPARS  Protein parsimony          
DNAPARS   Parsimony method for DNA
DNAMOVE   Interactive DNA parsimony  
DNAPENNY  Branch and bound for DNA
DNABOOT   Bootstraps DNA parsimony   
DNACOMP   Compatibility for DNA
DNAINVAR  Phylogenetic invariants    
DNAML     Maximum likelihood method
DNAMLK    DNAML with molecular clock 
DNADIST   Distances from sequences
RESTML    ML for restriction sites
 
    ----------- Programs for distance matrix
data ------------
FITCH     Fitch-Margoliash and least-squares
methods
KITSCH    Fitch-Margoliash and least squares methods
with
          evolutionary clock
 
    --- Programs for gene frequencies and
continuous characters --
CONTML    Maximum likelihood method  
GENDIST   Computes genetic distances
 
    ------------- Programs for discrete state
data -----------
MIX       Wagner, Camin-Sokal, and mixed
parsimony criteria
MOVE      Interactive Wagner, C-S, mixed parsimony
program
PENNY     Finds all most parsimonious trees by
branch-and-bound
BOOT      Bootstrap confidence interval on mixed
parsimony methods
DOLLOP,
DOLMOVE, DOLPENNY, DOLBOOT   same as
preceding four
          programs, but for the Dollo and
polymorphism parsimony 
          criteria
CLIQUE    Compatibility method       
FACTOR    recode multistate characters
 
    ---- Programs for plotting trees and
consensus trees ----
DRAWGRAM  Draws cladograms and phenograms on screens,
plotters and 
          printers
DRAWTREE  Draws unrooted phylogenies on screens,
plotters and 
          printers
CONSENSE  Majority-rule and strict consensus trees
 
The
package includes extensive documentation files that provide the 
information
necessary to use and modify the programs.
 
COMPATIBILITY:
The programs are written in a very standard subset of 
Pascal,
a language that is available on most computers (including 
microcomputers).
The programs require only trivial modifications to 
run
on most machines: for example they work with only minor 
modifications
with Turbo Pascal, and without modifications on VAX 
VMS
Pascal. Pascal source code is distributed in the regular version 
of
PHYLIP: compiled object code is not.  To
use that version, you 
must
have a Pascal compiler.
 
DISKETTE
DISTRIBUTION: The package is distributed in a variety of 
microcomputer
diskette formats.   You should send
FORMATTED 
diskettes,
which I will return with the package written on them. 
Unfortunately,
I cannot write any Apple formats.   See
below for how 
many
diskettes to send.  The programs on the
magnetic tape or 
electronic
network versions may of course also be moved to 
microcomputers
using a terminal program.
 
PRECOMPILED
VERSIONS: Precompiled executable programs for PCDOS 
systems
are available from me.  Specify the
"PCDOS executable 
version"
and send the number of extra diskettes indicated below.   
An
Apple Macintosh version with precompiled code is available from 
Willem
Ellis, Instituut voor Taxonomische Zoologie, Zoologisch 
Museum,
Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH 
Amsterdam,
Netherlands, who asks that you send 5 800K diskettes.
 
HOW
MANY DISKETTES TO SEND: The following table shows for different 
PCDOS
formats how many diskettes to send, and how many extra 
diskettes
to send for the PCDOS executable version: 
 
Diskette
size   Density   For source code    For
executables, send
                                               
in addition
  3.5 inch      1.44 Mb         
2                     1
  5.25 inch      1.2 Mb          2                     2
  3.5 inch       720 Kb         
4                     2
  5.25 inch      360 Kb          7                     4
 
Some
other formats are also available. You MUST tell me EXACTLY 
which
of these formats you need.  The
diskettes MUST be formatted by 
you
before being sent to me. Sending an extra diskette may be 
helpful.
 
NETWORK
DISTRIBUTION: The package is also available by distribution 
of
the files directly over electronic networks, and by anonymous ftp 
from
evolution.genetics.washington.edu. 
Contact me by electronic 
mail
for details.
 
TAPE
DISTRIBUTION: The programs are also distributed on a magnetic 
tape
provided by you (which should be a small tape and need only be 
able
to hold two megabytes) in the following format: 9-track, ASCII, 
odd
parity, unlabelled, 6250 bpi (unless otherwise indicated).  
Logical
record: 80 bytes, physical record: 3200 bytes (i.e. blocking 
factor
40). There are a total of 71 files. The first one describes 
the
contents of the package.
 
POLICIES:
The package is distributed free.  I do
not make it 
available
or support it in South Africa.  The
package will be 
written
on the diskettes or tape, which will be mailed back.  They 
can
be sent to:
 
                                           Joe Felsenstein
Electronic
mail addresses:                
Department of Genetics SK-50
 Internet: 
joe@genetics.washington.edu   
University of Washington
 Bitnet/EARN: 
felsenst@uwavm             
Seattle, Washington 98195
 UUCP: 
uw-beaver!evolution.genetics!joe  
U.S.A.
 
 
=====================
End of Phylip Info. Sheet ====================
 
 
 
 
REFERENCES.
 
Dayhoff,
M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of 
Protein
Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O. 
(ed.),
NBRF, Washington, p. 345.  
 
Felsenstein,
J. (1985)  Confidence limits on
phylogenies: an 
approach
using the bootstrap.  Evolution 39,
783-791.
 
Feng,
D.-F. and Doolittle, R.F. (1987) 
Progressive sequence 
alignment
as a prerequisite to correct phylogenetic trees.  
J.Mol.Evol.
25, 351-360.
 
Gotoh,
O. (1982)  An improved algorithm for
matching biological 
sequences.  J.Mol.Biol. 162, 705-708.
 
Gribskov,
M., McLachlan, A.D. and Eisenberg, D. (1987) Profile 
analysis:
detection of distantly related proteins. PNAS USA 84, 
4355-4358.
 
Higgins,
D.G. and Sharp, P.M. (1988)  CLUSTAL: a
package for 
performing
multiple sequence alignments on a microcomputer.  Gene 
73,
237-244.
 
Higgins,
D.G. and Sharp, P.M. (1989)  Fast and
sensitive multiple 
sequence
alignments on a microcomputer.  CABIOS
5, 151-153.
 
Kimura,
M. (1980)   A simple method for
estimating evolutionary 
rates
of base substitutions through comparative studies of 
nucleotide
sequences. J. Mol. Evol. 16, 111-120.
 
Kimura,
M. (1983)   The Neutral Theory of
Molecular Evolution.  
Cambridge
University Press, Cambridge, England.
 
Li,
W.-H., Wu, C.-I. and Luo, C.-C. (1985) 
A new method for 
estimating
synonymous and nonsynonymous rates of nucleotide 
substitution
considering the relative likelihood of nucleotide and 
codon
changes.  Mol.Biol.Evol. 2, 150-174.
 
Myers,
E.W. and Miller, W. (1988)  Optimal
alignments in linear 
space.  CABIOS 4, 11-17.
 
Pearson,
W.R. and Lipman, D.J. (1988)  Improved
tools for biological 
sequence
comparison.  PNAS USA 85, 2444-2448.
 
Saitou,
N. and Nei, M. (1987)  The
neighbor-joining method: a new 
method
for reconstructing phylogenetic trees. 
Mol.Biol.Evol. 4, 
406-425.
 
Sneath,
P.H.A. and Sokal, R.R. (1973)  Numerical
Taxonomy.  Freeman, 
San
Francisco.
 
Sokal,
R.R. and Michener, C.D. (1958)  A
statistical method for 
evaluating
systematic relationships.  Univ.Kansas
Sci.Bull. 38, 
1409-1438.
 
Vingron,
M. and Argos, P. (1991)  Motif
recognition and alignment 
for
many sequences by comparison of dot matrices. 
J.Mol.Biol. 218, 
33-43.
 
Wilbur,
W.J. and Lipman, D.J. (1983)  Rapid
similarity searches of 
nucleic
acid and protein data banks.  PNAS USA
80, 726-730.