SDSE V2.0 - User's manual
SDSE: Simulation of DNA Sequence Evolution
Version 2.0 - 1989
User's manual
Jose L. Oliver, Unidad de Genetica, Facultad de
Ciencias
Universidad de Granada, E-18071-GRANADA (Spain)
EM: oliver@ugr.es
Published references:
Oliver, J.L., A. Marin & J.R. Medina. 1989. SDSE:
A software package to simulate the evolution of a pair of DNA sequences. Computer
Applications in the Biosciences (CABIOS) 5: 47-50.
Rodriguez, F., J.L. Oliver, A. Marin & J.R.
Medina. The general stochastic model of nucleotide substitution. Journal of
Theoretical Biology (in the press).
Copyright (C) Jose L. Oliver, 1989
All rights reserved
Fundamentals
In the study of molecular evolution, DNA sequences are
more informative than protein sequences, since there are synonymous codons as
well as many DNA sequences which are not translated to protein. The
evolutionary change of DNA occurs either by nucleotide substitution, or by
deletion and insertion. SDSE is an integrated program package designed to
simulate DNA sequence evolution through nucleotide substitution. There are many
schemes under which base substitutions can occur (see Nei, 1987 for a review);
SDSE allows the simulation of DNA evolution under any of these schemes. The
simulation is carried out on finite, variable length, DNA sequences through a
strict stochastic process adapted to the particular substitution rates imposed
by each scheme. SDSE incorporates a given number of random substitutions into
an 'ancestral' nucleotide sequence, thus obtaining an 'evolved' sequence;
version 2.0 also allows the estimation of the average number of nucleotide
substitutions per nucleotide site accumulated by two of these evolved DNA
sequences. Detailed descriptions and applications of this software package have
been reported elsewhere (Oliver et al., 1989; Rodriguez et al., 1989).
When two simulated descendent sequences accumulate a
high number of nucleotide substitutions (as it occurs in actual sequences
diverged long time ago), multiple and parallel changes may have taken place; if
so, the observed proportion of different nucleotides between the two sequences
must be an underestimate of the true number of nucleotide substitutions. To
avoid this underestimation, various statistical methods have been proposed:
Jukes-Cantor (1969) single parameter (JC) method, Kimura's (1980) two-parameter
(K2P) method, Kimura's (1981) three-substitution-type (K3ST) method, Takahata
& Kimura's (1981) (TK) method, Gojobori et al's (1982) (GIN) method, Tajima
& Nei's (1984) (TN) method, etc. In order to compare the validity of these
methods and determine their range of applicability in comparison with other
models, a computer simulation of DNA sequence evolution must be performed.
For
convenience the program used a discrete time, rather than a continuous time,
approach. Equilibrium frequencies for a given substitution scheme were computed
by squaring the corresponding transition matrix repeatedly. Squaring was
continued until all the elements in each column have the same value (Gojobori
et al., 1982); the equilibrium composition is equal to any row and, as the
process starts at the equilibrium composition, it stays at it.
For each scheme, a random DNA sequence of a given
number of nucleotides at equilibrium frequencies is generated, which is then
used as an ancestral sequence in the simulation. From this sequence, a
descendent sequence can be obtained by nucleotide substitution according to any
of the schemes mentioned above. Another run of the program provides a second
descendent sequence which can be then compared to the first one.
The evolution of the ancestral sequence to produce a
descendent sequence is simulated as follows. A nucleotide site in the sequence
is chosen at random. Any of the four nucleotides (A, T, C or G) may be at this
site. Let T, for example, be the nucleotide occupying this site. The range 0-1
is divided into four segments and the transition matrix is used to assign the
length of each segment; the length of the first segment is proportional to the
probability that T does not change (T˙Ä>˙T); the length of the three
remaining segments are made proportional to the probabilities that T is
substituted by A (T˙Ä>˙A), by C (T˙Ä>˙C) or by G (T˙Ä>˙G). A random
number in the range 0-1 is then generated. The original T is not changed if
this number lay within the first segment, but is substituted by A, C or G if
the random number lay within the second, third or fourth segment, respectively.
The same is done with any nucleotide occupying the chosen site. A similar
process is repeated the desired number of steps, thus generating an
"evolved" sequence.
Every pair of "evolved" sequences obtained
are then compared and the matrix of nucleotide pair frequencies for each
sequence pair is computed. The different methods of estimating nucleotide
substitutions are applied to these nucleotide pair frequency data to estimate
the evolutionary distances.
Description of the programs
The simulation itself is carried out by the program
XSIM, but, in order to facilitate its use, other five utilities have been added
to the package: XSET, XSCH, XSEC and XDIV. The data disk drive can be set up
with XSET; by means of XSCH one can edit any nucleotide substitution scheme and
compute the corresponding nucleotide equilibrium frequencies; the program XSEC
generates a random DNA sequence with the given (e.g.˙equilibrium) nucleotide
frequencies, which then can be used as an ancestral sequence; the program XSIM
simulates the evolution of this ancestral sequence using the substitution rates
of the chosen scheme and generates an "evolved" sequence having
experienced the desired number of random nucleotide changes; lastly, the
program XDIV compares two "evolved" sequences and computes several
indexes of evolutionary distance. All the programs are menu driven and
interfaced to the system through a principal menu.
All sequence data files used and generated by these
programs conform to the standard GenBank database format (Burks et al., 1985;
Bilofsky et al., 1986), thus allowing, on the one hand, to use as ancestral
sequence any sequence retrieved from this database and, on the other hand, the
application of other standard packages to analyze or manipulate simulated
sequences. With sequences retrieved from the EMBL nucleotide sequence databank
(Hamm & Cameron, 1986), one can first convert the file to the GenBank
format by using some available utility, e.g. the program AUTHORIN (GenBank,
IntelliGenetics, Inc., 1989). Since all files generated by SDSE are written in
ASCII, sequences or substitution schemes can be easily handled through DOS
commands; for example, they can be viewed on the screen by using the DOS
command TYPE <filename> or directed to any standard printer by using the
DOS command COPY <filename> PRN, etc.
SDSE V2.0 was developed on an IBM/PS2 80/111 computer
running MS-DOS version 4.0, but it will run on any IBM/PC, IBM/PS2 or
compatible computer running DOS 3.2 or higher. The minimum system configuration
is 256K random access memory (RAM) and two disk drives. The programs are
written in FORTRAN77 (Microsoft Fortran V4.01) and are available in machine
(compiled) code. The distributed implementation does not require the numeric
coprocessor but, if present, it should not interfere the execution of the
programs. A compiled version exploiting the 80387 coprocessor is also available
from the author.
Using the programs
After connecting the computer and boot the DOS, you
enter the programs simply by putting the program disk in the default drive and
typing SDSE. You can also copy all the programs on the distribution diskette to
a subdirectory on the hard disk and type SDSE. The following menu appears on
the screen:
SDSE V2.0: Simulation of DNA Sequence Evolution
1. Set up data disk
2. Enter substitution schemes
3. Generate random DNA sequences
4. Simulate DNA evolution
5. Estimate divergence
6. Exit to DOS
1.- The first choice is to set up the data disk drive.
This program configures the package for your system. For example, programs can
go in drive A and data on drive B, or vice versa. You can intend any other
configuration, e.g. copy the programs to a subdirectory on the hard disk and
use also this (or other) subdirectory as data disk.
2.- This option allows to edit substitution schemes
and computes the corresponding equilibrium nucleotide frequencies. Through an
interactive process you can fix the rates of nucleotide substitution, assigning
values to the different parameters used by the most known schemes (see Fig. 1):
Jukes-Cantor (1969) single parameter (JC) method, Kimura's (1980) two-parameter
(K2P) method, Kimura's (1981) three-substitution-type (K3ST) method, Takahata &
Kimura's (1981) (TK) method, Gojobori et al's (1982) (GIN) method, and Tajima
& Nei's (1984) (TN) method. You can also enter directly the twelve
substitution rates, thus using a nonÄstandard substitution scheme (OT). The
edited scheme and the resulting equilibrium frequencies are first shown in the
screen and then written to a file in your data disk, whose filename will be
terminated in .SCH. Thus, you can have a file JC.SCH for the JukesÄCantor
substitution scheme, a file K2P.SCH for the KimuraÄ2P, etc.
3.- By choosing option 3 you may generate random DNA
sequences with the nucleotide frequencies you wish. This utility allows, for
example, generate an ancestral sequence with the equilibrium frequencies
corresponding to a particular substitution scheme (computed with option 2). The
maximum number of nucleotides allowed was 3000; a version for larger sequences
can be obtained from the author.
Since the generated sequence must be structured in
compliance with GenBank format (Burks et al, 1985; Bilofsky et al., 1986), this
program demand information for the directory entry of this database. The
directory is the first section of the sequence file. The following directory
lines are implemented by the program. The LOCUS line has the name of the
sequence (a short unique name for the entry, chosen to suggest the sequence's
DEFINITION), the number of nucleotides and the date the sequence was generated.
The DEFINITION line includes a concise description of the sequence. In the
COMMENT line the program includes both the demanded and the resulting
nucleotide frequencies. The BASE COUNT line gives the different nucleotide
counts. The ORIGIN line is the last line in the directory, immediately
preceding the sequence itself, and describes the source of the sequence (e.g.
random generated).
4.- This option simulates the process of DNA
evolution. Following a selected substitution scheme, the program incorporates
the demanded number of random nucleotide substitutions on the ancestral
sequence, thus generating an "evolved" sequence, which is written,
under the name you wish, to the data disk. One may generate two o more
"evolved" sequences from a same ancestral sequence and compare them
with the following utility.
5.- The option 5 allows you to estimate the divergence
accumulated by two "evolved" sequences through several known
algorithms. These includes Jukes and Cantor (1969), Kimura (1980, 1981),
Takahata and Kimura (1981) Gojobori, Ishii and Nei (1982) and Tajima and Nei
(1984). The program also gives the matrix of nucleotide pair frequencies and
the true divergence suffered by the sequences. All these results are first
shown in the screen, and then written to the file XDIV.RES on the data disk;
you can obtain a hardcopy of this file by using the DOS command: COPY XDIV.RES
PRN. Version 2.0 also allows to compute all the above mentioned statistics on
the basis of the matrix of nucleotide pair frequencies (computed through any
other way).
6.- By choosing this option you quit the program and
return to DOS.
References
Bilofsky, H.S., C. Burks, J.W. Fickett, W.B. Goad,
F.L. Lewitter, W.P. Rindone, C.D. Swindell, and C.S. Tung. 1986. The GenBank
genetic sequence databank. Nucl. Acids Res. 14: 1Ä4.
Burks, C., Fickett, J.W., Goad, W.B., Kanehisa, M.,
Lewitter, F.I., Rindone, W.P., Swindell, C.D., Tung, C.S., and Bilofsky, H.S.
1985. The GenBank nucleic acid sequence database. Computer Applications in the
Biosciences (CABIOS) 1: 225-233.
Gojobori, T., Ishii, K., and Nei, M. 1982. Estimation
of average number of nucleotide substitutions when the rate of substitution
varies with nucleotide. J. Mol. Evol. 18: 414-423.
Hamm, G.H. and G.N. Cameron. 1986. The EMBL data
library. Nucl. Acids Res. 14: 5-9.
Jukes, T.H., and Cantor, C.R. 1969. Evolution of
proteins molecules, in Munro, H.N. (ed.) Mammalian protein metabolism. Academic
Press, New York, pp. 21123.
Kimura, M. 1980. A simple method for estimating
evolutionary rates of base substitutions through comparative studies of
nucleotide sequences. J. Mol. Evol. 16: 111-120.
Kimura, M. 1981. Estimation of evolutionary distances
between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA. 78:
454-458.
Nei, M. 1987. Molecular Evolutionary Genetics. Columbia
University Press, New York.
Oliver, J.L., A. Marin & J.R. Medina. 1989. SDSE:
A software package to simulate the evolution of a pair of DNA sequences. Computer
Applications in the Biosciences (CABIOS) 5: 47-50.
Rodriguez, F., J.L. Oliver, A. Marin & J.R.
Medina. 1989. The general stochastic model of nucleotide substitution. Journal
of Theoretical Biology (in the press).
Tajima, F. and Nei, M. 1984. Estimation of
evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:
269-285.
Takahata, N. and Kimura, M. 1981. A model of evolutionary
base substitutions and its application with special reference to rapid change
of pseudogenes. Genetics 98: 641-657.