††††† ††††† Clustal VMultiple Sequence Alignments.


††††† ††††† Documentation (Installation and Usage).


††††† ††††† Des Higgins

††††† ††††† European Molecular Biology Laboratory

††††† ††††† Postfach 10.2209

††††† ††††† D-6900 Heidelberg

††††† ††††† Germany.


††††† ††††† higgins@EMBL-Heidelberg.DE






††††† ††††† Contents.



††††† ††††† 1†††† ††††† Overview


††††† ††††† 2†††† ††††† Installation


††††† ††††† 3†††† ††††† Interactive usage


††††† ††††† 4†††† ††††† Command-line interface


††††† ††††† 5†††† ††††† Algorithms and references





††††† ††††† 1.Overview


This document describes how to install and use ClustalV on various

machines.ClustalV is a complete upgrade and rewrite of the Clustal

package of multiple alignment programs (Higgins and Sharp, 1988 and

1989).†† The original programs were written in Fortran for

microcomputers running MSDOS.†† You carried out a complete alignment

by running 3 programs in succession.†† Later, these were merged into

a single menu driven program with on-line help, for VAX/VMS.

ClustalV was written in C and has all of the features of the old

programs plus many new ones.It has been compiled and tested using

VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C

for IBM PC's and Think C for Apple Mac's.†† The original Clustal was

written by Des Higgins while he was a Post-Doc in the lab of Paul

Sharp in the Genetics Department, Trinity College, Dublin 2,



The main feature of the old package was the ability to carry out

reliable multiple alignments of many sequences.The sensitivity of

the program is as good as from any other program we have tried, with

the exception of the programs of Vingron and Argos (1991), while it

works in reasonable time on a microcomputer.The programs of

Vingron and Argos are specialised for finding distant similarities

between proteins but require mainframes or workstations and are more

difficult to use.


The main new features are: profile alignments (alignments of old

alignments); phylogenetic trees (Neighbor Joining trees calculated

after multiple alignment with a bootstrapping option); better

sequence input (automatically recognise and read NBRF/PIR, Pearson

(Fasta) or EMBL/SwissProt formats); flexible alignment output

(choose one of: old Clustal format, NBRF/PIR, GCG msf format or

Phylip format); full command line interface (everything that you can

do interactively can be specified on the command line).


In version 7 of the GCG package, there is a program called PILEUP

which uses a very similar algorithm to the one in ClustalV.There

are 2 main differences between the programs: 1) the metric used to

compare the sequences for the initial "guide tree" uses a full

global, optimal alignment in PILEUP instead of the fast, approximate

ones in ClustalV.This makes PILEUP much slower for the comparison

of long sequences.In principle, the distances calculated from

PILEUP will be more sensitive than ours, but in practice it will not

make much difference, except in difficult cases.2)During the

multiple alignment, terminal gaps are penalised in ClustalV but not

in PILEUP.This will make the PILEUP alignments better when the

sequences are of very different lengths (has no effect if there are

no large terminal gaps).††



This software may be distributed and used freely, provided that you

do not modify it or this documentation in any way without the

permission of the authors.


If you wish to refer to ClustalV, please cite:

Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software

for multiple sequence alignment.ms. submitted to CABIOS.


The overall multiple alignment algorithm was described in:

Higgins,D.G. and Sharp,P.M. (1989).Fast and sensitive multiple

sequence alignments on a microcomputer.CABIOS, vol. 5, 151-153.





D.H. would particularly like to thank Paul Sharp, in whose lab. this

work originated.We also thank Manolo Gouy, Gene Myers, Peter Rice

and Martin Vingron for suggestions, bug-fixes and help.†††


Des Higgins and Rainer Fuchs,

EMBL Data Library, Heidelberg, Germany.


Alan Bleasby,

Daresbury, UK.


JUNE 1991



††††† ††††† 2.Installation.




As far as possible, we have tried to make ClustalV portable to any

machine with a standard C compiler (proposed ANSI C standard).The

source code, as supplied by us, has been compiled and tested using

the following compilers:



Ultrix C (on a Decstation 2100)

Gnu C on a Sun 4 workstation

Think C on an Apple Macintosh SE

Turbo C on an IBM AT.


In each case, one must make 1 change to 1 line of code in 1 header

file.This is described below.The exact capacity of the program

(how many sequences of what length can be aligned) will depend of

course on available memory but can also be set in this header file.


The package comes as 9 C source files; 3 header files; 1 file of on-

line help; this documentation file; 3 make files:


Source code:††††† clustalv.c, amenu.c, gcgcheck.c, myers.c, sequence.c,

††††††††††† ††††† showpair.c, trees.c, upgma.c, util.c


Header files:††††† clustalv.h, general.h, matrices.h


On-Line help:††††† clustalv.hlp(must be renamed or defined as †††††

††††††††††† ††††† clustalv_help except on PC's)


Documentation:††††† clustalv.doc (this file).


Makefiles:††††† makefile.sun (gnu c on Sun), vmslink.com (vax/vms),

††††††††††† ††††† makefile.ult (ultrix).








Before compiling ClustalV you must look at and possibly change

clustalV.h, shown below..





Main header file for ClustalV. Uncomment ONE of the following lines

depending on which compiler you wish to use.



#define VMS 1†††††††††††† /* VAX VMS */


/*#define MAC 1†††††††††† Think_C for MacIntosh */


/*#define MSDOS 1†††††††† Turbo C for PC's */


/*#define UNIX 1††††††††† Ultrix for Decstations or Gnu C for Sun */




#include "general.h"


#define MAXNAMES††††††††† 10

#define MAXTITLES†††††††† 60

#define FILENAMELEN††††† 256


#define UNKNOWN†† 0

#define EMBLSWISS 1

#define PIR†††††† 2

#define PEARSON†† 3


#define PAGE_LEN†††††† 22


#if VMS

#define DIRDELIM ']'

#define MAXLEN††††††††† 3000

#define MAXN††††† †††††††150

#define FSIZE††††††††† 15000

#define LINELENGTH††††††† 60

#define GCG_LINELENGTH††† 50


#elif MAC

#define DIRDELIM ':'

#define MAXLEN††††††††† 2600

#define MAXN††††††††††††† 30

#define FSIZE††††††††† 10000

#define LINELENGTH††††††† 50

#define GCG_LINELENGTH††† 50


#elif MSDOS

#define DIRDELIM '\\'

#define MAXLEN††††††††† 1300

#define MAXN††††††††††††† 30

#define FSIZE†††††††††† 5000

#define LINELENGTH††††††† 50

#define GCG_LINELENGTH††† 50


#elif UNIX

#define DIRDELIM '/'

#define MAXLEN†††††††† 3000

#define MAXN†††††††††††† 50

#define FSIZE†††††††† 15000

#define LINELENGTH†††††† 60

#define GCG_LINELENGTH†† 50






First, you must remove the comments from one of the first 10 lines.

There are 4 'define' compiler directives here (e.g. #define VMS 1),

and you should use one of these, depending on which system you wish

to work. So choose one of these, remove its comments (if it is

already commented out) and put comments around any of the others

that are still active. If you wish to use a different system, you

will need to insert a new line with a new keyword (which you must

invent) to identify your system.Most of the rest of this header

file is taken up with a block of 'define' statements for each system

type; e.g. the VAX/VMS block is:


#if VMS

#define DIRDELIM ']'

#define MAXLEN††††††††† 3000

#define MAXN†††††††††††† 150

#define FSIZE††††††††† 15000

#define LINELENGTH††††††† 60

#define GCG_LINELENGTH††† 50


In this block, you can specify the maximum number of sequences to be

allowed (MAXN); the maximum sequence length, including gaps

(MAXLEN);FSIZE declares the size of some workspace, used by the

fast 2 sequence comparison routines and should be APPROXIMATELY 4

times MAXLEN; LINELENGTH is the length of the blocks of alignment

output in the output files; GCG_LINELENGTH is the same but for the

GCG compatible output only.Finally, DIRDELIM is the character used

to specify directories and subdirectories in file names.It should

be the character used to seperate the file name itself from the

directory name (e.g. in VMS, file names are like:

$drive:[dir1.dir2.dir3]filename.ext;2so ']' is used as DIRDELIM).††


So, if you want to use a system, not covered in Clustalv.h, you will

have to insert a new block, like the above one.To compile and link

the program, we supply 3 makefiles: one each for VAX/VMS, Ultrix

and GNU C for Sun workstations.





Compile and link the program with the

supplied makefile for vms: vmslink.com .


$ @vmslink


This will produce clustalv.exe (and a lot of .obj files which you can delete).


The on-line help file (clustalv.hlp) should be 'defined' as

clustalv_help as follows:


$ def clustalv_help $drive:[dir1.dir2]clustalv.hlp


where $drive is the drive designation and [dir1.dir2] is the

directory where clustalv.hlp is kept.


To make use of the command-line interface, you must make clustalv a

'foreign' command with:


$ clustalv :== $$drive:[dir1.dir2]clustalv


where $drive is the drive designation and [dir1.dir2] is the

directory where clustalv.exe is kept.






Create a makefile (something.prj) with the names of the source files

(clustalv.c, amenu.c etc.) and 'make' this using the HUGE memory

model.You will get half a dozen warnings from the compiler about

pieces of code than look suspicious to it but ignore these.The

help file should remain as clustalv.hlp .†† To run the program using

the default settings in Clustalv.h, you need approximately 500k of

memory.To reduce this, the main influence on memory usage is the

parameter MAXLEN; reduce MAXLEN to reduce memory usage.




Apple Mac/THINK_C version 4.0.2


This version of the program is not at all Mac like.It runs in a

window, the inside of which looks just like a normal character based

terminal.In the future we might put a proper Mac interface on it

but do not have the time right now.With the default settings in

the header file ClustalV.h, you need just over 800k of memory to run

the program.To reduce this, reduce MAXLEN; this is easily the

biggest influence on memory usage.To compile the program and save

it as an application you need to 'set the application type'; here

you specify how much memory (in kilobytes (k)) the application will

need.You should set this to 900k to run the application as it is

OR reduce MAXLEN in the header.To compile the program you have to

create a 'project'; you 'add' the names of the 9 source files to the

project AND the name of the ANSI library.The source code is too

large to compile in one compilation unit.You will get a 'link

error: code segment too big' if you try to compile and link as is.

You should compile amenu.c (the biggest source file) as a seperate

unit ..... you will have to read the manual/ask someone/mail me to

find out what this is.





††††† ††††† 3.Interactive usage.




Interactive usage of Clustal V is completely menu driven.On-line

help is provided, defaults are offered for all parameters and file

names.With a little effort it should be completely self

explanatory.†† The main menu, which appears when you run the

programs is shown below.Each item brings you to a sub menu.




Main menu for Clustal V:



†††† 1. Sequence Input From Disc

†††† 2. Multiple Alignments

†††† 3. Profile Alignments

†††† 4. Phylogenetic trees


†††† S. Execute a system command

†††† H. HELP

†††† X. EXIT (leave program)



Your choice:




The options S and H appear on all the main menus.H will provide

help and if you type S you will be asked to enter a command, such as

DIR or LS, which will be sent to the system (does not work on

Mac's).Before carrying out an alignment, you must use option 1

(sequence input); the format for sequences is explained below.

Under menu item 2 you will be able to automatically align your

sequences to each other.Menu item 3 allows you to do profile

alignments.These are alignments of old alignments.This allows

you to build up a multiple alignment in stages or add a new sequence

to an old alignment.†† You can calculate phylogenetic trees from

alignments using menu item 4.





††††† ******************************

††††† *†††††† SEQUENCE INPUT.††††† *

††††† ******************************



All sequences should be in 1 file.Three formats are automatically

recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and

Lipman (1988) format).††



Users of the Wisconsin GCG package should use the command TONBRF

(recently changed to TOPIR) to reformat their sequences before use.



Sequences can be in upper or lower case.For proteins, the only

symbols recognised are:A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and

for DNA/RNA use: A,C,G and T (or U).Any other letters of the

alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown.

All other symbols (blanks, digits etc.) will be ignored EXCEPT for

the hyphen "-" which can be used to specify a gap.This last point

is especially useful for 2 reasons: 1) you can fix the positions of

some gaps in advance; 2) the alignment output from this program can

be written out in NBRF format using "-"'s to specify gaps; these

alignments can be used again as input, either for profile alignments

or for phylogenetic trees.


If you are using an editor to create sequence files, use the FASTA

format as it is by far the simplest (see below).If you have access

to utility programs for generating/converting the NBRF/PIR format

then use it in preference.




FASTA (PEARSON AND LIPMAN, 1988) FORMAT:†††† The sequences are

delimited by an angle bracket ">" in column 1.The text immediately

after the ">" is used as a title.Everything on the following line

until the next ">" or the end of the file is one sequence.




> RABSTOUT†† rabbit Guinness receptor



> MUSNOSE†† mouse nose drying factor

††† mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt

††† fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv

> HSHEAVEN††† human Guinness receptor repeat

mhkmmhkgmkhmhgmhmhg†† lhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt

fdgfdsgafdagfdgfsag†† dfavdfdvgavfsvfgvdfsvdgvagvfdv

mhkmmhkgmkhmhgmhmhg†† lhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt

fdgfdsgafdagfdgfsag†† dfavdfdvgavfsvfgvdfsvdgvagvfdv




NBRF/PIR FORMAT†††††††† is similar to FASTA format but immediately

after the ">", you find the characters "P1;" if the sequences are

protein or "DL;" if they are nucleic acid.Clustalv looks for the

";" character as the third character after the ">".If it finds one

it assumes that the format is NBRF if not, FASTA format is assumed.

The text after the ";" is treated as a sequence name while the

entire next line is treated as a title.The sequence is terminated

by a star "*" and the next sequence can then begin (with a >P1; etc

).This is just the basic format description (there are other

variations and rules).


ANY files/sequences in GCG format can be converted to this format

using the TONBRF command (now TOPIR) of the Wisconsin GCG package.






rabbit Guinness receptor




mouse nose drying factor





human Guinness receptor repeat protein.

mhkmmhkgmkhmhgmhmhg†† lhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt

fdgfdsgafdagfdgfsag†† dfavdfdvgavfsvfgvdfsvdgvagvfdv

mhkmmhkgmkhmhgmhmhg†† lhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt

fdgfdsgafdagfdgfsag†† dfavdfdvgavfsvfgvdfsvdgvagvfdv*




EMBL/SWISSPROT FORMAT:†††††† Do not try to create files with this

format unless you have utilities to help.If you are just using an

editor, use one of the above formats.If you do use this format,

the program will ignore everything between the ID line (line

beginning with the characters "ID") and the SQ line.The sequence

is then read from between the SQ line and the "//" characters.




It is critically important for the program to know whether or not it

is aligning DNA or protein sequences.The input routines attempt to

guess which type of sequence is being used by counting the number of

A,C,G,T or U's in the sequences.If the total is more than 85% of

the sequence length then DNA is assumed.If you use very bizarre

sequences (proteins with really strange aa compositions or DNA

sequences with loads of strange ambiguity codes) you might confuse

the program.It is difficult to do but be careful.






††††† ******************************


††††† ******************************


The multiple alignment menu is shown below.Before explaining how

to use it, you must be introduced briefly to the alignment strategy.

If you do not follow this, try using option 1 anyway; the entire

process will be carried out automatically.


To do a complete multiple alignment, we need to know the approximate

relationships of the sequences to each other (which ones are most

similar to each other).We do this by calculating a crude

phylogenetic tree which we call a dendrogram (to distinguish it from

the more sensitive trees available under the phylogenetic tree

menu).†† This dendrogram is used as a guide to align bigger and

bigger groups of sequences during the multiple alignment.The

dendrogram is calculated in 2 stages: 1) all pairs of sequence are

compared using the fast/approximate method of Wilbur and Lipman

(1983); the result of each comparison is a similarity score. 2) the

similarity scores are used to construct the dendrogram using the

UPGMA cluster analysis method of Sneath and Sokal (1973).


The construction of the dendrogram can be very time consuming if you

wish to align many sequences (e.g. for 100 sequences you need to

carry out 100x99/2 sequence comparisons = 4950). During every

multiple alignment, a dendrogram is constructed and saved to a file

(something.dnd).These can be reused later.












††† 1.Do complete multiple alignment now

††† 2.Produce dendrogram file only

††† 3.Use old dendrogram file

††† 4.Pairwise alignment parameters

††† 5.Multiple alignment parameters

††† 6.Output format options


††† S.Execute a system command

††† H.HELP

††† or press [RETURN] to go back to main menu



Your choice:



So, if in doubt, and you have already loaded some sequences from the

main menu, just try option 1 and press the <Return> key in response

to any questions.You will be prompted for 2 file names e.g. if the

sequence input file was called DRINK.PEP, you will be offered

DRINK.ALN as the file to contain the alignment and DRINK.DND for the



If you wish to repeat a multiple alignment (e.g. to experiment with

different gap penalties) but do not wish to make a dendrogram all

over again use menu item 3(providing you are using the same

sequences).Similarly, menu item 2 allows you to produce the

dendrogram file only.







The parameters that control the initial fast/approximate comparisons

can be set from menu item 4 which looks like:






†††† 1. Toggle Scoring Method:Percentage

†††† 2. Gap Penalty††††††††††† :3

†††† 3. K-tuple††††††††††††††† :1

†††† 4. No. of top diagonals†† :5

†††† 5. Window size††††††††††† :5


†††† H. HELP



Enter number (or [RETURN] to exit):




The similarity scores are calculated from fast alignments generated

by the method of Wilbur and Lipman (1983).These are 'hash' or

'word' or 'k-tuple' alignments carried out in 3 stages.


First you mark the positions of every fragment of sequence, K-tuple

long (for proteins, the default length is 1 residue, for DNA it is 2

bases) in both sequences.Then you locate all k-tuple matches

between the 2 sequences.†† At this stage you have to imagine a dot-

matrix plot between the 2 sequences with each k-tuple match as a

dot.†† You find those diagonals in the plot with most matches (you

take the "No. of top diagonals" best ones) and mark all diagonals

within "Window size" of each top diagonal.This process will define

diagonal bands in the plot where you hope the most likely regions of

similarity will lie.


The final alignment stage is to find that head to tail arrangement

of k-tuple matches from these diagonal regions that will give the

highest score.The score is calculated as the number of exactly

matching residues in this alignment minus a "gap penalty" for every

gap that was introduced.When you toggle "Scoring method" you

choose between expressing these similarity scores as raw scores or

expressed as a percentage of the shorter sequence length.


K-TUPLE SIZE:†† Can be 1 or 2 for proteins; 1 to 4 for DNA.

Increase this to increase speed; decrease to improve sensitivity.


GAP PENALTY:††† The number of matching residues that must be found

in order to introduce a gap.This should be larger than K-Tuple

Size.This has little effect on speed or sensitivity.


NO. OF TOP DIAGONALS:††† The number of best diagonals in the

imaginary dot-matrix plot that are considered.Decrease (must be

greater than zero) to increase speed; increase to improve



WINDOW SIZE:††† The number of diagonals around each "top" diagonal

that are considered.†† Decrease for speed; increase for greater



SCORING METHOD: The similarity scores may be expressed as raw scores

(number of identical residues minus a "gap penalty" for each gap) or

as percentage scores.If the sequences are of very different

lengths, percentage scores make more sense.






The main reason for wanting to change the above parameters is SPEED

(especially on microcomputers), NOT SENSITIVITY.†† The dendrograms

that are produced can only show the relationships between the

sequences APPROXIMATELY because the similarity scores are calculated

from seperate pairwise alignments; not from a multiple alignment

(that is what we eventually hope to produce).If the groupings of

the sequences are "obvious", the above method should work well; if

the relationships are obscure or weakly represented by the data, it

will not make much difference playing with the parameters.The main

factor influencing speed is the K-TUPLE SIZE followed by the WINDOW



The alignments are carried out in a small amount of memory.

Occasionally (it is hard to predict), you will run out of memory

while doing these alignments; when this happens, it will say on the

screen: "Sequences (a,b) partially aligned" (instead of "Sequences

(a,b) aligned").This means that the alignment score for these

sequences will be approximate;it is not a problem unless many of

the alignments do this.It can be fixed by using less sensitive

parameters or increasing parameter FSIZE in clustalv.h .





The similarity scores generated by the fast comparison of all the

sequences are used to construct a dendrogram by the UPGMA method of

Sneath and Sokal (1973).This is a form of cluster analysis and the

end result produces something that looks like a tree.It represents

the similarity of the sequences as a hierarchy.The dendrogram is

written to a file in a machine readable format and is ahown below

for an example with 6 sequences.



††† 91.0†† 0†† 0†† 2†† 012000†††††††† ! seq 2 joins seq 3 at 91% ID.

††† 72.0†† 1†† 0†† 3†† 011200†††††††† ! seq 4 joins seqs 2,3 at 72%

††† 71.1†† 0†† 0†† 2†† 000012†††††††† ! seq 5 joins seq 6 at 71%

††† 35.5†† 0†† 2†† 4†† 122200†††††††† ! seq 1 joins seqs 2,3,4

††† 21.7†† 4†† 3†† 6†† 111122†††††††† ! seqs 1,2,3,4 join seqs 5,6


This LOOKS complicated but you do not normally need to care what is

in here.Anyway, each row represents the joining together of 2 or

more sequences.You progress from the top down, joining more and

more sequences until all are joined together; for N sequences you

have N-1 groupings hence there are 5 rows in the above file (there

were 6 sequences).In each row, the first number is the similarity

score of this grouping; ignore the next three columns for the

moment; the last 6 digits in the line show which sequences are

grouped; there is one digit for each sequence (the first digit is

for the first sequence).The rule is:in each row, all of the "1"s

join all of the "2"s; the zero's do nothing.††


Hence, in the first row, sequence 2 joins sequence 3 at a similarity

level of 91% identity; next, sequence 4 joins the previous grouping

of 2 plus 3 at a level of 72% etc.†† This is shown diagrammatically

below.Before leaving the dendrogram format, the other 3 columns of

numbers are: a pointer to the row from which the "1" sequences were

last joined (or zero if only one of them); a pointer to the row in

which the "2"s were last joined; the total number of sequences

joined in this line.





††††††††††††††††††††† I------ 2

†††††††††††††† I------I

†††††††††††††† I††††† I------ 3Diagram of the sequence similarity

††††††††† I----I

††††††††† I††† I------------- 4relationships shown in the above

†††††† I--I

†††††† II------------------ 1dendrogram file (branch lengths are

†† ----I

†††††† I†††††† I------------- 5not to scale).

†††††† I-------I

†††††††††††††† I------------- 6













Having calculated a dendrogram between a set of sequences, the final

multiple alignment is carried out by a series of alignments of

larger and larger groups of sequences.The order is determined by

the dendrogram so that the most similar sequences get aligned first.

Any gaps that are introduced in the early alignments are fixed.

When two groups of sequences are aligned against each other, a full

protein weight matrix (such as a Dayhoff PAM 250) is used.Two gap

penalties are offered: a "FIXED" penalty for opening up a gap and a

"FLOATING" penalty for extending a gap.






†††† 1. Fixed Gap Penalty†††††† :10

†††† 2. Floating Gap Penalty††† :10

†††† 3. Toggle Transitions (DNA):Weighted

†††† 4. Protein weight matrix†† :PAM 250


†††† H. HELP



Enter number (or [RETURN] to exit):



FIXED GAP PENALTY:†† Reduce this to encourage gaps of all sizes;

increase it to discourage them.†† Terminal gaps are penalised same

as all others.BEWARE of making this too small (approx 5 or so); if

the penalty is too small, the program may prefer to align each

sequence opposite one long gap.


FLOATING GAP PENALTY:†† Reduce this to encourage longer gaps;

increase it to shorten them.†† Terminal gaps are penalised same as

all others.BEWARE of making this too small (approx 5 or so); if

the penalty is too small, the program may prefer to align each

sequence opposite one long gap.



DNA TRANSITIONS = WEIGHTED or UNWEIGHTED:†† By default, transitions

(A versus G; C versus T) are weighted more strongly than

transversions (an A aligned with a G will be preferred to an A

aligned with a C or a T).You can make all pairs of nucleotide

equally weighted with this option.


PROTEIN WEIGHT MATRIX:For protein comparisons, a weight matrix is

used to differentially weight different pairs of aligned amino

acids.The default is the well known Dayhoff PAM 250 matrix.We

also offer a PAM 100 matrix, an identity matrix (all weights are the

same for exact matches) or allow you to give the name of a file with

your own matrix.The weight matrices used by Clustal V are shown in

full in the Algorithms and References section of this documentation.


If you input a matrix from a file, it must be in the following

format.Use a 20x20 matrix only (entries for the 20 "normal" amino

acids only; no ambiguity codes etc.).Input the lower left triangle

of the matrix, INCLUDING the diagonal.The order of the amino acids

(rows and columns) must be: CSTPAGNDEQHRKMILVFYW.The values can be

in free format seperated by spaces (not commas).The PAM 250 matrix

is shown below in this format.



†† 02




-310 -115

-410 -1002

-500 -1 0124

-500 -100134

-5 -1 -100 -11224

-3 -1 -10 -1 -221136

-40 -10 -2 -30 -1 -1126

-500 -1 -1 -21001035

-5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2006

-2 -10 -2 -1 -3 -2 -2 -2 -2 -2 -2 -225

-6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3426

-2 -10 -10 -1 -2 -2 -2 -2 -2 -2 -22424

-4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5012 -19

†† 0 -3 -3 -5 -3 -5 -2 -4 -4 -40 -4 -4 -2 -1 -1 -2 7 10

-8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -32 -3 -4 -5 -2 -600 17


Values must be integers and can be all positive or positive and

negative as above.These are SIMILARITY values.







By default, the alignment goes to a file in a self explanatory

"blocked" alignment format.This format is fine for displaying the

results but requires heavy editing if you wish to use the alignment

with other software.To help, we provide 3 other formats which can

be turned on or off.If you have a sequence data set or alignment

in memory, you can also ask for output files in whatever formats are

turned on, NOW.The menu you use to choose format is shown below.


We draw your attention to NBRF/PIR format in particular.This

format is EXACTLY the same as one of the input formats.Therefore,

alignments written in this format can be used again as input (to the

profile alignments or phylogenetic trees).




********* Format of Alignment Output *********



†††† 1. Toggle CLUSTAL format output†† =ON

†††† 2. Toggle NBRF/PIR format output=OFF

†††† 3. Toggle GCG format output†††††† =OFF

†††† 4. Toggle PHYLIP format output††† =OFF


†††† 5. Create alignment output file(s) now?

†††† H. HELP



Enter number (or [RETURN] to exit):




CLUSTAL FORMAT:†††† This is a self explanatory alignment.The

alignment is written out in blocks.Identities are highlighted and

(if you use a PAM 250 matrix) positions in the alignment where all

of the residues are "similar" to each other (PAM 250 score of 8 or

more) are indicated.


NBRF/PIR FORMAT:††† This is the usual NBRF/PIR format with gaps

indicated by hyphens ("-"). AS we have stressed before, this format

is EXACTLY compatible with the sequence input format.Therefore you

can read in these alignments again for profile alignments or for

calculating phylogenetic trees.


GCG FORMAT:†††††††† In version 7 of the Wisconsin GCG package, a new

multiple sequence format was introduced.This is the MSF (Multiple

Sequence Format) format.It can be used as input to the GCG

sequence editor or any of the GCG programs that make use of multiple




PHYLIP FORMAT:††††† This format can be used by the Phylip package of

Joe Felsenstein (see the references/algorithms section for details

of how to get it).Phylip allows you to do a huge range of

phylogenetic analyses (we just offer one method in this program) and

is probably the most widely used set of programs for drawing trees.

It also works on just about every computer you can think of,

providing you have a decent Pascal compiler.






††††† ******************************


††††† ******************************




This menu is for taking two old alignments (or single sequences) and

aligning them with each other.The result is one bigger alignment.

The menu is very similar to the multiple alignment menu except that

there is no mention of dendrograms here (they are not needed) and

you need to input two sets of sequences.The menu looks like this:







††† 1.Input 1st. profile/sequence

†† 2.Input 2nd. profile/sequence

††† 3.Do alignment now

††† 4.Alignment parameters

††† 5.Output format options


††† S.Execute a system command

††† H.HELP

††† or press [RETURN] to go back to main menu



Your choice:



You must input profile number 1 first.†† When both profiles are

loaded, use item 3 (Do alignment now) and the 2 profiles will be

aligned.Items 4 and 5 (parameters and output options) are

identical to the equivalent options on the multiple alignment menu.


The same input routines that are used for general input are used

here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA

format, with gaps indicated by hyphens ("-").This is why we have

continualy drawn your attention to the NBRF/PIR format as a useful

output format.


Either profile can consist of just one sequence.†† Therefore, if you

have a favourite alignment of sequences that you are working on and

wish to add a new sequence, you can use this menu, provided the

alignment is in the correct format.


The total number of sequences in the two profiles must be less less

than or equal to the MAXN parameter set in the clustalv.h header













††††† ******************************


††††† ******************************



This menu allows you to input an alignment and calculate a

phylogenetic tree.You can also calculate a tree if you have just

carried out a multiple alignment and the alignment is still in

memory.THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!!†† The tree will

look strange if the sequences are not already aligned.You can also

"BOOTSTRAP" the tree to show confidence levels for groupings.This

is SLOW on microcomputers but works fine on workstations or








††† 1.Input an alignment

††† 2.Exclude positions with gaps?††††††† = OFF

††† 3.Correct for multiple substitutions? = OFF

††† 4.Draw tree now

††† 5.Bootstrap tree


††† S.Execute a system command

††† H.HELP

††† or press [RETURN] to go back to main menu



Your choice:





The same input routine that is used for general input is used here

i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,

with gaps indicated by hyphens ("-").This is why we have

continualy drawn your attention to the NBRF/PIR format as a useful

output format.


If you have input an alignment, then just use item 4 to draw a tree.

The method used is the Neighbor Joining method of Saitou and Nei

(1987).This is a "distance method". First, percent divergence

figures are calculated between all pairs of sequence.These

divergence figures are then used by the NJ method to give the tree.

Example trees will be shown below.


There are two options which can be used to control the way the

distances are calculated.These are set by options 2 and 3 in the



EXCLUDE POSITIONS WITH GAPS?†† This option allows you to ignore all

alignment positions (columns) where there is a gap in ANY sequence.

This guarantees that "like" is compared with "like" in all distances

i.e. the same positions are used to calculate all distances.It

also means that the distances will be "metric".The disadvantage of

using this option is that you throw away much of the data if there

are many gaps.If the total number of gaps is small, it has little



substitutions accumulate.It becomes increasingly likely that more

than one substitution (as a result of a mutation) will have happened

at a site where you observe just one difference now.This option

allows you to use formulae developed by Motoo Kimura to correct for

this effect.It has the effect of stretching long branches in tres

while leaving short ones relatively untouched.The desired effect

is to try and make distances proportional to time since divergence.


The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the

name of the input, alignment file.An example is shown below for 6

globin sequences.




DIST†† = percentage divergence (/100)

Length = number of sites used in comparison


†† 1 vs.†† 2DIST = 0.5683;length =††† 139

†† 1 vs.†† 3DIST = 0.5540;length =††† 139

†† 1 vs.†† 4DIST = 0.5315;length =††† 111

†† 1 vs.†† 5DIST = 0.7447;length =††† 141

†† 1 vs.†† 6DIST = 0.7571;length =††† 140

†† 2 vs.†† 3DIST = 0.0897;length =††† 145

†† 2 vs.†† 4DIST = 0.1391;length =††115

†† 2 vs.†† 5DIST = 0.7517;length =††† 145

†† 2 vs.†† 6DIST = 0.7431;length =††† 144

†† 3 vs.†† 4DIST = 0.0957;length =††† 115

†† 3 vs.†† 5DIST = 0.7379;length =††† 145

†† 3 vs.†† 6DIST = 0.7361;length =††† 144

†† 4 vs.†† 5DIST = 0.7304;length =††† 115

†† 4 vs.†† 6DIST = 0.7368;length =††† 114

†† 5 vs.†† 6DIST = 0.2697;length =††† 152



††††††††††† ††††† Neighbor-joining Method


Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:

A New Method for Reconstructing Phylogenetic Trees.

Mol. Biol. Evol., 4(4), 406-425



This is an UNROOTED tree


Numbers in parentheses are branch lengths



Cycle†† 1†††† =SEQ:†† 5 (0.13382) joinsSEQ:†† 6 (0.13592)


Cycle†† 2†††† =SEQ:†† 1 (0.28142) joins Node: ††5 (0.33462)


Cycle†† 3†††† =SEQ:†† 2 (0.05879) joinsSEQ:†† 3 (0.03086)


Cycle†† 4 (Last cycle, trichotomy):


††††††††††† Node:†† 1 (0.20798) joins

††††††††††† Node:†† 2 (0.02341) joins

††††††††††† SEQ:†† 4 (0.04915)




The output file first shows the percent divergence (distance)

figures between each pair of sequence.Then a description of a NJ

tree is given.This description shows which sequences (SEQ:) or

which groups of sequences (NODE: , a node is numbered using the

lowest sequence that belongs to it) join at each level of the tree.


This is an unrooted tree!! This means that the direction of

evolution through the tree is not shown.This can only be inferred

in one of two ways:

1) assume a degree of constancy in the molecular clock and place the

root (bottom of the tree; the point where all the sequences radiate

from) half way along the longest branch.†††† **OR**

2) use an "outgroup", a sequence from an organism that you "know"

must be outside of the rest of the sequences i.e. root the tree

manually, on biological grounds.


The above tree can be represented diagramatically as follows:



††††††††††††††††††††††††† SEQ 1†††††† SEQ 4

†††††††††††††††††††††††††† I†††††††††† I

††††††††† 13.6†††††††††††† I 28.1††††† I 4.9††††††††† 5.9

SEQ 6 ----------I††††††† I†††††††††† I††††††††† I--------- SEQ 2

††††††††††††††††† I††††††† I†††††††††† I††††††††† I

††††††††††††††††† I--------I-----------I----------I

††††††††† 13.4††† I33.5††††† 20.8†††††††† 2.3†† I†† 3.1

SEQ 5 ----------I†††††††††††††††††††††††††††††† I--------- SEQ 3



The figures along each branch are percent divergences along that

branch.If you root the tree by placing the root along the longest

branch (33.5%) then you can draw it again as follows, this time





††††††††††††††††††††††† 13.6

††††††††††††††† I-------------------- SEQ 6

††††† I---------I†††††† 13.4

††††† I†††††††† I-------------------- SEQ 5

††††† I 33.5

-----I†††††††††††††††† 28.1

††††† I†††††††† I-------------------- SEQ 1

††††† I†††††††† I

††††† I---------I††††††††††††††† 4.9

††††††††††††††† I20.8I----------- SEQ 4

††††††††††††††† I--------I

†††††††††††††††††††††††† I†††††† 5.9

†††††††††††††††††††††††† I 2.3 I----- SEQ 2

†††††††††††††††††††††††† I-----I 3.1

†††††††††††††††††††††††††††††† I----- SEQ 3




The longest branch (33.5% between 5,6 and 1,2,3,4) is split between

the 2 bottom branches of the tree.As it happens in this particular

case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4

are alpha and beta globins, so you could also justify the above

rooting on biological grounds.If you do not have any particular

need or evidence for the position of the root, then LEAVE THE TREE

UNROOTED.Unrooted trees do not look as pretty as rooted ones but

it is uaual to leave them unrooted if you do not have any evidence

for the position of the root.



BOTSTRAPPING:††† Different sets of sequences and different tree

drawing methods may give different topologies (branching orders) for

parts of a tree that are weakly supported by the data.It is useful

to have an indication of the degree of error in the tree.There are

several ways of doing this, some of them rather technical.We

provide one general purpose method in this program, which makes use

of a technique called bootstrapping (see Felsenstein, 1985).


In the case of sequence alignments, bootstrapping involves taking

random samples of positions from the alignment.If the alignment

has N positions, each bootstrap sample consists of a random sample

of N positions, taken WITH REPLACEMENT i.e. in any given sample,

some sites may be sampled several times, others not at all.Then,

with each sample of sites, you calculate a distance matrix as usual

and draw a tree.If the data very strongly support just one tree

then the sample trees will be very similar to each other and to the

original tree, drawn without bootstrapping.However, if parts of

the tree are not well supported, then the sample trees will vary

considerably in how they represent these parts.


In practice, you should use a very large number of bootstrap

replicates (1000 is recommended, even if it means running the

program for an hour on a slow microcomputer; on a workstation it

will be MUCH faster).For each grouping on the tree, you record the

number of times this grouping occurs in the sample trees.For a

group to be considered "significant" at the 95% level (or P <= 0.05

in statistical terms) you expect the grouping to show up in >= 95%

of the sample trees.If this happens, then you can say that the

grouping is significant, given the data set and the method used to

draw the tree.


So, when you use the bootstrap option, a NJ tree is drawn as before

and then you are asked to say how many bootstrap samples you want

(1000 is the default) and you are asked to give a seed number for

the random number generator.If you give the same seed number in

future, you will get the same results (we hope).Remember to give

different seed numbers if you wish to carry out genuinely different

bootstrap sampling experiments.Below is the output file from using

the same data for the 6 globin sequences as used before.The output

file has the same name as the input fike with the extension ".njb".



STUFF DELETED.... same as for the ordinary NJ output


††††††††††† ††††† Bootstrap Confidence Limits



Random number generator seed =††††† 99


Number of bootstrap trials†† =††† 1000



Diagrammatic representation of the above tree:


Each row represents 1 tree cycle; defining 2 groups.


Each column is 1 sequence; the stars in each line show 1 group;

the dots show the other


Numbers show occurences in bootstrap samples.

****..†† 1000†††††††††††††

.***..†† 1000††††††††††††††† <- This is the answer!!

*..***††† 812




For an unrooted tree with N sequences, there are actually only N-3

genuinely different groupings that we can test (this is the number

of "internal branches"; each internal branch splits the sequences

into 2 groups).In this example, we have 6 sequences with 3

internal branches in the reference tree.In the bootstrap

resampling, we count how often each of these internal branches

occur.Here, we find that the branch which splits 1,2,3 and 4

versus 1 and 2 occurs in all 1000 samples; the branch which splits

2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2

and 3 versus 1,4,5 and 6 occurs in 812/1000 samples.We can put

these figures on to the diagrammatic representation we made earlier

of our unrooted NJ tree as follows:




††††††††††††††††††††††††† SEQ 1†††††† SEQ 4

†††††††††††††††††††††††††† I†††††††††† I

†††††††††††††††††††††††††† I†††††††††† I†††††††††††

SEQ 6 ----------I††††††I†††††††††† I††††††††† I--------- SEQ 2

††††††††††††††††† I1000I†† 1000††† I†† 812††† I

††††††††††††††††† I--------I-----------I----------I

††††††††††††††††† I†††††††††††††††††††††††††††††† I†††

SEQ 5 ----------I†††††††††††††††††††††††††† ††††I--------- SEQ 3




You can equally put these confidence figures on the rooted tree (in

fact the interpretation is simpler with rooted trees).With the

unrooted tree, the grouping of sequence 5 with 6 is significant (as

is the grouping of sequences 1,2,3 and 4).Equally the grouping of

sequences 1,5 and 6 is significant (the same as saying that 2,3 and

4 group significantly).However, the grouping of 2 and 3 is not

significant, although it is relatively strongly supported.


Unfortunately, there is a small complication in the interpretation

of these results.In statistical hypothesis testing, it is not

valid to make multiple simultaneous tests and to treat the result of

each test completely independantly.In the above case, if you have

one particular test (grouping) that you wish to make in advance, it

is valid to test IT ALONE and to simply show the other bootstrap

figures for reference.If you do not have any particular test in

mind before you do the bootstrapping, you can just show all of the

figures and use the 95% level as an ARBITRARY cut off to show those

groups that are very strongly supported; but not mention anything

about SIGNIFICANCE testing.In the literature, it is common

practice to simply show the figures with a tree; they frequently

speak for themselves.






††††† ††††† 4.Command Line Interface.




You can do almost everything that can be done from the menus, using

a command line interface. In this mode, the program will take all of

its instructions as "switches" when you activate it; no questions

will be asked; if there are no errors, the program just does an

analysis and stops.†† It does not work so well on the MAC but is

still possible.To get you started we will show you the 2 simplest

uses of the command line as it looks on VAX/VMS.On all other

machines (except the MAC) it works in the same way.


$ clustalv /help†††††††††† **OR**†† $ clustalv /check


Both of the above switches give you a one page summary of the

command line on the screen and then the program stops.



$ clustalv proteins.seq††† **OR**†† $ clustalv /infile=proteins.seq†††


This will read the sequences from the file 'proteins.seq' and do a

complete multiple alignment.Default parameters will be used, the

program will try to tell whether or not the sequences are DNA or

protein and the output will go to a file called 'proteins.aln' . A

dendrogram file called 'proteins.dnd' will also be created.Thus

the default action for the program, when it successfully reads in an

input file is to do a full multiple alignment.Some further

examples of command line usage will be given leter.


Command line switches can be abbreviated but MAKE SURE YOU DO NOT

MAKE THEM AMBIGUOUS.No attempt will be made to detect ambiguity.

Use enough characters to distinguish each switch uniquely.








The full list of allowed switches is given below:



††††††††††††††† DATA (sequences)


/INFILE=file.ext††† :input sequences.If you give an input file and

††††††††††† ††††† ††††† nothing else as a switch, the default action is

††††††††††† ††††† ††††† to do a complete multiple alignment.The input

††††††††††† ††††† ††††† file can also be specified by giving it as the

††††††††††† ††††† ††††† first command line parameter with no "/" in ††

††††††††††† ††††† ††††† front of it e.g $ clustalv file.ext.


/PROFILE1=file.ext††††† :You use these two switches to give the names of

/PROFILE2=file.ext††††† two profiles.The default action is to align

††††††††††† ††††† ††††† the two. You must give the names of both profile

††††††††††† ††††† ††††† files.




††††††††††††††† VERBS (do things)


/HELP†††† ††††† :list the command line parameters on the screen.



/ALIGN††††††† ††††† :do full multiple alignment.This is the default †††

††††††††††† ††††† action if no other switches except for input files

††††††††††† ††††† are given.


/TREE††††† ††††† :calculate NJ tree.If this is the only action †††††

††††††††††† ††††† specified (e.g. $ clustalv proteins.seq/tree ) it IS


††††††††††† ††††† the sequences are not already aligned, you should ††††

††††††††††† ††††† also give the /ALIGN switch.This will align the

††††††††††† ††††† sequences first, output an alignment file and ††

††††††††††† ††††† calculate the tree in memory.


/BOOTSTRAP(=n)††††† :bootstrap a NJ tree (n= number of bootstraps; †††††

††††††††††† ††††† default = 1000).If this is the only action †††† †††††

††††††††††† ††††† specified (e.g. $ clustalv proteins.seq/bootstrap )


††††††††††† ††††† If the sequences are not already aligned, you should

††††††††††† ††††† also give the /ALIGN switch.This will align the

††††††††††† ††††† sequences first, output an alignment file and ††

††††††††††† ††††† calculate the bootstraps in memory.You can set the

††††††††††† ††††† number of bootstrap trials here (e.g./bootstrap=500).

††††††††††† ††††† You can set the seed number for the random number ††††

††††††††††† ††††† generator with /seed=n.




††††††††††††††† PARAMETERS (set things)


***Pairwise alignments:***


/KTUP=n††††† ††††† :word size†††††††††††††


/TOPDIAGS=n††††† :number of best diagonals


/WINDOW=n††† ††††† :window around best diagonals

/PAIRGAP=n†† ††††† :gap penalty




***Multiple alignments:***


/FIXEDGAP=n††††† :fixed length gap pen.


/FLOATGAP=n††††† :variable length gap pen.


/MATRIX=†††† ††††† :PAM100 or ID or file name. The default weight matrix

††††††††††† ††††† for proteins is PAM 250.


/TYPE=p or d ††††† :type is protein or DNA.†† This allows you to ††††

††††††††††† ††††† explicitely overide the programs attempt at guessing

††††††††††† ††††† the type of the sequence.It is only useful if you

††††††††††† ††††† are using sequences with a VERY strange composition.


/OUTPUT=†††† ††††† :GCG or PHYLIP or PIR.The default output is

††††††††††† ††††† Clustal format.


/TRANSIT†††† ††††† :transitions not weighted.The default is to weight

††††††††††† ††††† transitions as more favourable than other mismatches

††††††††††† ††††† in DNA alignments.This switch makes all nucleotide

††††††††††† ††††† mismatches equally weighted.





/KIMURA††††† ††††† :use Kimura's correction on distances.††


/TOSSGAPS††† ††††† :ignore positions with a gap in ANY sequence.


/SEED=n††††† ††††† :seed number for bootstraps.







These examples use the VAX/VMS $ prompt; otherwise, command-line

usage is the same on all machines except the Macintosh.


$ clustalv proteins.seq††††† OR†††† $ clustalv /infile=proteins.seq


Read whatever sequences are in the file "proteins.seq" and do a full

multiple alignment; output will go to the files: "proteins.dnd"

(dendrogram) and "proteins.aln" (alignment).



$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir


Same as last example but use K-Tuple size of 2; use a PAM 100

protein weight matrix; write the alignment out in NBRF/PIR format

(goes to a file called "proteins.pir").



$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11


Take the alignment in "proteins.seq" and align it with "more.seq"

using default values for everything except the fixed gap penalty

which is set to 11.The sequence type is explicitely set to




$ clustalv proteins.pir/tree/kimura


Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY)

and calculate a phylogenetic tree using Kimura's correction for




$ clustalv proteins.pir/align/tree/kimura


Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE




$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p


Take the sequences in proteins.seq; they are explicitely set to be

protein; align them; bootstrap a tree using 500 samples and a seed

number of 99.





††††† ††††† 5.Algorithms and references.




In this section, we will try to BRIEFLY describe the algorithms used

in ClustalV and give references.The topics covered are:



††††† -Multiple alignments


††††† -Profile alignments


††††† -Protein weight matrices


††††† -Phylogenetic trees


††††† ††††† -distances


††††† ††††† -NJ method


††††† ††††† -Bootstrapping


††††† ††††† -Phylip


††††† -References









The approach used in ClustalV is a modified version of the method of

Feng and Doolittle (1987) who aligned the sequences in larger and

larger groups according to the branching order in an initial

phylogenetic tree.This approach allows a very useful combination

of computational tractability and sensitivity.


The positions of gaps that are generated in early alignments remain

through later stages.This can be justified because gaps that arise

from the comparison of closely related sequences should not be moved

because of later alignment with more distantly related sequences.

At each alignment stage, you align two groups of already aligned

sequences.This is done using a dynamic programming algorithm where

one allows the residues that occur in every sequence at each

alignment position to contribute to the alignment score.A Dayhoff

(1978) PAM matrix is used in protein comparisons.


The details of the algorithm used in ClustalV have been published in

Higgins and Sharp (1989).This was an improved version of an

earlier algorithm published in Higgins and Sharp (1988).First, you

calculate a crude similarity measure between every pair of sequence.

This is done using the fast, approximate alignment algorithm of

Wilbur and Lipman (1983).Then, these scores are used to calculate

a "guide tree" or dendrogram, which will tell the multiple alignment

stage in which order to align the sequences for the final multiple

alignment.This "guide tree" is calculated using the UPGMA method

of Sneath and Sokal (1973).UPGMA is a fancy name for one type of

average linkage cluster analysis, invented by Sokal and Michener



Having calculated the dendrogram, the sequences are aligned in

larger and larger groups.At each alignment stage, we use the

algorithm of Myers and Miller (1988) for the optimal alignments.

This algorithm is a very memory efficient variation of Gotoh's

algorithm (Gotoh, 1982).It is because of this algorithm that

ClustalV can work on microcomputers.†† Each of these alignments

consists of aligning 2 alignments, using what we call "profile






We use the term "profile alignment" to describe the alignment of 2

alignments.We use this term because the method is a simple

extension of the profile method of Gribskov, et al. (1987) for

aligning 1 sequence with an alignment.Normally, with a 2 sequence

alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a

score between the pairs of aligned residues.The alignment is

considered "optimal" if it gives the best total score for aligned

residues minus penalties for any gaps (insertions or deletions) that

must be introduced.


Profile alignments are a simple extension of 2 sequence alignments

in that you can treat each of the two input alignments as single

sequences but you calculate the score at aligned positions as the

average weight matrix score of all the residues in one alignment

versus all those in the other e.g. if you have 2 alignments with I

and J sequences respectively; the score at any position is the

average of all the I times J scores of the residues compared

seperately.Any gaps that are introduced are placed in all of the

sequences of an alignment at the same position.The profile

alignments offered in the "profile alignment menu" are also

calculated in this way.





There are 3 built-in weight matrices used by clustalV.These are

the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity

matrix.Each matrix is given as the bottom left half, including the

diagonal of a 20 by 20 matrix.The order of the rows and columns is




PAM 250



S†† 02




G-310 -115

N-410 -1002

D-500 -10124

E-500 -100134

Q-5 -1 -1 00 -11224

H-3 -1 -10 -1 -221136

R-40 -10 -2 -30 -1 -1126

K-500 -1 -1 -21001035

M-5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2006

I-2 -10 -2 -1 -3 -2 -2 -2 -2 -2 -2 -225

L-6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3426

V-2 -10 -10 -1 -2 -2 -2 -2 -2 -2 -22424

F-4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5012 -19

Y†† 0 -3 -3 -5 -3 -5 -2 -4 -4 -40 -4 -4 -2 -1 -1 -27 10

W-8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -32 -3 -4 -5 -2 -600 17




















0000000 00000010





000000000000000000 10

0000000000000000000 10






PAM 100




-52†† 7


-52†† 2†† 1†† 6

-81-3-3†† 1†† 8

-82†† 0-3-1-17

-11 -1-2-4-1-14†† 8

-11 -2-3-3†† 0-21†† 5†† 8

-11 -3-3-1-2-5 -1†† 1†† 4†† 9

-6 -4-5-2-5-72-1-2†† 4 11

-6 -1-4, -2-5-8 -3-6-5†† 11 10

-11 -2-1-4-4-51-2-2-1 -338

-11 -4-2-6-3-8 -5-8-6-2 -7 -21 13

-5 -4-1-6-3-7 -4-6-5-5 -7 -4429

-12 -7-5-5-5-8 -6-9-7-3 -5 -7 -6429

-4 -4-1-4†† 0-4 -5-6-5-5 -6 -6 -61518

-10 -5-6-9-7-8 -6 -11 -11 -10 -4 -7-11 -200 -5 12

-2 -6-6 -11-6 -11 -3-9-7-9 -1-10-10 -8 -4 -5 -66 13

-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -71 -9-11-12 -7-14 -2 -2 19







There are two COMMONLY used approaches for inferring phylogentic

trees from sequence data: parsimony and distance methods. There are

other approaches which are probably superior in theory but which are

yet to be used widely. This does not mean that they are no use; we

(the authors of this program at any rate) simply do not know enough

about them yet.You should see the documentation accompanying the

Phylip package and some of the references there for an explanation

of the different methods and what assumptions are implied when you

use them.††


There is a constant debate in the literature as to the merits of

different methods but unfortunately, a lot of what is said is

incomprehensible or inaccurate.It is also a field that is prone to

having highly opinionated schools of thought.This is a pity

because it prevents rational discussion of the pro's and con's of

the different methods.The approach adopted in ClustalV is to

supply just one method and to produce alignments in a format that

can be used by Phylip.In simple cases, the trees produced will be

as "good" (reliable, robust) as those from ANY other method.In

more complicated cases, there is no single magic recipe that we can

supply that will work well in even most situations.


The method we provide is the Neighbor Joining method (NJ) of Saitou

and Nei (1987) which is a distance method.We use this for three

reasons:it is conceptually and computationally simple; it is fast;

it gives "good" trees in simple cases. It is difficult to prove that

one tree is "better" than another if you do not know the true

phylogeny; the few systematic surveys of methods show it to work

more or less as well as any other method ON AVERAGE.Another reason

for using the NJ method is that it is very commonly used; THIS IS A

BAD REASON SCIENTIFICALLY but at least you will not feel lonely if

you use it.


The NJ method works on a matrix of distances (the distance matrix)

between all pairs of sequence to be analysed.These distances are

related to the degree of divergence between the sequences.It is

normal to calculate the distances from the sequences after they are

multiply aligned.If you calculate them from seperate alignments

(as done for the dendrograms in another part of this program), you

may increase the error considerably.





The simplest measure of distance between sequences is percent

divergence (100% minus percent identity).For two sequences, you

count how many positions differ between them (ignoring all positions

with a gap or an unknown residue) and divide by the number of

positions considered.It is common practice to also ignore all

positions in the alignment where there is a GAP in ANY of the

sequences (Tossgaps ? option in the menu).Usually, you express the

percent distance divided by 100 (gives distances between 0.0 and



This measure of distance is perfectly adequate (with some further

modification described below) for rRNA sequences. However it treats

all residues identically e.g. all amino acid substitutions are

equally weighted. It also treats all positions identically e.g. it

does not take account of different rates of substitution in

different positions of different codons in protein coding DNA

sequences; see Li et al (1985) for a distance measure that does.

Despite these shortcomings, these percent identity distances do work

well in practice in a wide variety of situations.


In a simple world, you would like a distance to be proportional to

the time since the sequences diverged.If this were EXACTLY true,

then the calculation of the tree would be a simple matter of algebra

(UPGMA does this for you) and the branch lengths will be nice and

meaningful (times).In practice this OBVIOUSLY depends on the

existence and quality of the "molecular clock", a subject of on-

going debate.However, even if there is a good clock, there is a

further problem with estimating divergences.As sequences diverge,

they become "saturated" with mutations.Sites can have

substitutions more than once.Calculated distances will

underestimate actual divergence times; the greater the divergence,

the greater the discrepancy.There are various methods for dealing

with this and we provide two commonly used ones, both due to Motoo

Kimura; one for proteins and one for DNA.



For distance K (percent divergence /100 ) ...


Correction for Protein distances:(Kimura, 1983).


†††††† Corrected K = -ln(1.0 - K - (K * k/5.0))




Correction for nucleotide distances: Kimura's 2-parameter method

(Kimura, 1980).


†††††† Corrected K = 0.5*ln(a) + 0.25*ln(b)


†††††† where†††† a = 1/(1 - 2*P - Q)

†††††† and†††††† b = 1/(1 - 2*Q)


†††††† P and Q are the proportions of transitions (A<-->G, C<-->T)

†††††† and transversions occuring between the sequences.



One paradoxical effect of these corrections, is that distances can

be corrected to have more than 100% divergence.That is because,

for very highly diverged sequences of length N, you can estimate

that more than N substitutions have occured by correcting the

observed distance in the above ways.Don't panic!






VERY briefly, the NJ method works as follows.You start by placing

the sequences in a star topology (no internal branches).You then

find that internal branch (take 2 sequences; join them; connect them

to the rest by the internal branch) which when added to the tree

will minimise the total branch length. The two joined sequences

(neighbours) are merged into a single sequence and the process is

repeated.For an unrooted tree with N sequences, there are N-3

internal branches.The above process is repeated N-3 times to give

the final tree.The full details are given in Saitou and Nei



As explained elsewhere in the documentation, you can only root the

tree by one of two methods:


1) assume a degree of constancy in the molecular clock and place the

root along the longest branch (internal or external).Methods that

appear to produce rooted trees automatically are often just doing

this without letting you know; this is true of UPGMA.


2) root the tree on biological grounds.The usual method is to

include an "outgroup", a sequence that you are certain will branch

to the outside of the tree.






Bootstrapping is a general purpose technique that can be used for

placing confidence limits on statistics that you estimate without

any knowledge of the underlying distribution (e.g. a normal or

poisson distribution).In the case of phylogenetic trees, there are

several analytical methods for placing confidence limits on

groupings (actually on the internal branches) but these are either

restricted to particular tree drawing methods or only work on small

trees of 4 or 5 sequences.Felsenstein (1985) showed how to use

bootstrapping to calculate confidence limits on trees.His approach

is completely general and can be applied to any tree drawing method.

The main assumption of the method in this context is that the sites

in the alignment are independant; this will be true of some sequence

alignments (e.g. pseudogenes) but not others (e.g. rRNA's).What

effect, lack of independance will have on the results is not known.


The method works by taking random samples of data from the complete

data set.You compute the test statistic (tree in this case) on

each sample.†† Variation in the statistic computed from the samples

gives a measure of variation in the statistic which can be used to

calculate confidence intervals.Each random sample is the same size

as the complete data set and is taken WITH REPLACEMENT i.e. a data

point can be selected more than once (or not at all) in any given



In the case of an alignment N residues long, each random sample is a

random selection of N sites form the alignment.For each sample, we

calculate a distance matrix and tree in the usual way.Variation in

the sample trees compared to a tree calculated from the full data

set gives an indication of how well supported the tree is by the

data.If the sample trees are very similar to each other and to the

full tree, then the tree is "strongly" supported; if the sample

trees show great variation, then the tree will be weakly supported.

In practice, you usually find some parts of a tree well supported,

others weakly.This can be seen by counting how often each

monophyletic group in the full tree occurs in the sample trees.


For a particular grouping, one considers it to be significant at the

95% level (P <= 0.05) if it occurs in 95% of the bootstrap samples.

If a grouping is significant, it is significant with respect to the

particular data set and method used for drawing the tree.

Biological "significance" is another matter.





The Phylip package was written by Joe Felsenstein, University of

Washington, USA.It provides Pascal source code for a large number

of programs for doing most types of phylogenetic analyses.The

Phylip format alignments produced by this program can be used by all

of the Phylip programs, version 3.4 or later (March 1991).It is

freely available from him as follows.




================= PHYLIP information sheet =====================


††† PHYLIP - Phylogeny Inference Package (version 3.3)


This is a FREE package of programs for inferring phylogenies and

carrying out certain related tasks.At present it contains 28

programs, which carry out different algorithms on different kinds of

data.The programs in the package are:


††††† ---------- Programs for molecular sequence data ----------

PROTPARSProtein parsimony†††††††††

DNAPARS†† Parsimony method for DNA

DNAMOVE†† Interactive DNA parsimony

DNAPENNYBranch and bound for DNA

DNABOOT†† Bootstraps DNA parsimony††

DNACOMP†† Compatibility for DNA

DNAINVARPhylogenetic invariants†††

DNAML†††† Maximum likelihood method

DNAMLK††† DNAML with molecular clock

DNADIST†† Distances from sequences

RESTML††† ML for restriction sites


††† ----------- Programs for distance matrix data ------------

FITCH†††† Fitch-Margoliash and least-squares methods

KITSCH††† Fitch-Margoliash and least squares methods with

††††††††† evolutionary clock


††† --- Programs for gene frequencies and continuous characters --

CONTML††† Maximum likelihood method

GENDIST†† Computes genetic distances


††† ------------- Programs for discrete state data -----------

MIX†††††† Wagner, Camin-Sokal, and mixed parsimony criteria

MOVE††††† Interactive Wagner, C-S, mixed parsimony program

PENNY†††† Finds all most parsimonious trees by branch-and-bound

BOOT††††† Bootstrap confidence interval on mixed parsimony methods

DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT†† same as preceding four

††††††††† programs, but for the Dollo and polymorphism parsimony

††††††††† criteria

CLIQUE††† Compatibility method††††††

FACTOR††† recode multistate characters


††† ---- Programs for plotting trees and consensus trees ----

DRAWGRAMDraws cladograms and phenograms on screens, plotters and

††††††††† printers

DRAWTREEDraws unrooted phylogenies on screens, plotters and

††††††††† printers

CONSENSEMajority-rule and strict consensus trees


The package includes extensive documentation files that provide the

information necessary to use and modify the programs.


COMPATIBILITY: The programs are written in a very standard subset of

Pascal, a language that is available on most computers (including

microcomputers). The programs require only trivial modifications to

run on most machines: for example they work with only minor

modifications with Turbo Pascal, and without modifications on VAX

VMS Pascal. Pascal source code is distributed in the regular version

of PHYLIP: compiled object code is not.To use that version, you

must have a Pascal compiler.


DISKETTE DISTRIBUTION: The package is distributed in a variety of

microcomputer diskette formats.†† You should send FORMATTED

diskettes, which I will return with the package written on them.

Unfortunately, I cannot write any Apple formats.†† See below for how

many diskettes to send.The programs on the magnetic tape or

electronic network versions may of course also be moved to

microcomputers using a terminal program.


PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS

systems are available from me.Specify the "PCDOS executable

version" and send the number of extra diskettes indicated below.††

An Apple Macintosh version with precompiled code is available from

Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch

Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH

Amsterdam, Netherlands, who asks that you send 5 800K diskettes.


HOW MANY DISKETTES TO SEND: The following table shows for different

PCDOS formats how many diskettes to send, and how many extra

diskettes to send for the PCDOS executable version:


Diskette size†† Density†† For source code††† For executables, send

††††††††††††††††††††††††††††††††††††††††††††††† in addition

3.5 inch††††† 1.44 Mb††††††††† 2†††††††††††††††††††† 1

5.25 inch††††† 1.2 Mb††††††††† 2†††††††††††††††††††† 2

3.5 inch†††††† 720 Kb††††††††† 4†††††††††††††††††††† 2

5.25 inch††††† 360 Kb††††††††† 7†††††††††††††††††††† 4


Some other formats are also available. You MUST tell me EXACTLY

which of these formats you need.The diskettes MUST be formatted by

you before being sent to me. Sending an extra diskette may be



NETWORK DISTRIBUTION: The package is also available by distribution

of the files directly over electronic networks, and by anonymous ftp

from evolution.genetics.washington.edu.Contact me by electronic

mail for details.


TAPE DISTRIBUTION: The programs are also distributed on a magnetic

tape provided by you (which should be a small tape and need only be

able to hold two megabytes) in the following format: 9-track, ASCII,

odd parity, unlabelled, 6250 bpi (unless otherwise indicated).

Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking

factor 40). There are a total of 71 files. The first one describes

the contents of the package.


POLICIES: The package is distributed free.I do not make it

available or support it in South Africa.The package will be

written on the diskettes or tape, which will be mailed back.They

can be sent to:


††††† †††††††††††††††††††††††††††††††††††††Joe Felsenstein

Electronic mail addresses:†††††††††††††††† Department of Genetics SK-50

Internet:joe@genetics.washington.edu††† University of Washington

Bitnet/EARN:felsenst@uwavm††††††††††††† Seattle, Washington 98195

UUCP:uw-beaver!evolution.genetics!joe†† U.S.A.



===================== End of Phylip Info. Sheet ====================







Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of

Protein Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O.

(ed.), NBRF, Washington, p. 345.


Felsenstein, J. (1985)Confidence limits on phylogenies: an

approach using the bootstrap.Evolution 39, 783-791.


Feng, D.-F. and Doolittle, R.F. (1987)Progressive sequence

alignment as a prerequisite to correct phylogenetic trees.

J.Mol.Evol. 25, 351-360.


Gotoh, O. (1982)An improved algorithm for matching biological

sequences.J.Mol.Biol. 162, 705-708.


Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile

analysis: detection of distantly related proteins. PNAS USA 84,



Higgins, D.G. and Sharp, P.M. (1988)CLUSTAL: a package for

performing multiple sequence alignments on a microcomputer.Gene

73, 237-244.


Higgins, D.G. and Sharp, P.M. (1989)Fast and sensitive multiple

sequence alignments on a microcomputer.CABIOS 5, 151-153.


Kimura, M. (1980)†† A simple method for estimating evolutionary

rates of base substitutions through comparative studies of

nucleotide sequences. J. Mol. Evol. 16, 111-120.


Kimura, M. (1983)†† The Neutral Theory of Molecular Evolution.

Cambridge University Press, Cambridge, England.


Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985)A new method for

estimating synonymous and nonsynonymous rates of nucleotide

substitution considering the relative likelihood of nucleotide and

codon changes.Mol.Biol.Evol. 2, 150-174.


Myers, E.W. and Miller, W. (1988)Optimal alignments in linear

space.CABIOS 4, 11-17.


Pearson, W.R. and Lipman, D.J. (1988)Improved tools for biological

sequence comparison.PNAS USA 85, 2444-2448.


Saitou, N. and Nei, M. (1987)The neighbor-joining method: a new

method for reconstructing phylogenetic trees.Mol.Biol.Evol. 4,



Sneath, P.H.A. and Sokal, R.R. (1973)Numerical Taxonomy.Freeman,

San Francisco.


Sokal, R.R. and Michener, C.D. (1958)A statistical method for

evaluating systematic relationships.Univ.Kansas Sci.Bull. 38,



Vingron, M. and Argos, P. (1991)Motif recognition and alignment

for many sequences by comparison of dot matrices.J.Mol.Biol. 218,



Wilbur, W.J. and Lipman, D.J. (1983)Rapid similarity searches of

nucleic acid and protein data banks.PNAS USA 80, 726-730.