SDSE V2.0 - User's manual

SDSE: Simulation of DNA Sequence Evolution

Version 2.0 - 1989

User's manual

Jose L. Oliver, Unidad de Genetica, Facultad de Ciencias

Universidad de Granada, E-18071-GRANADA (Spain)

EM: oliver@ugr.es

Published references:

Oliver, J.L., A. Marin & J.R. Medina. 1989. SDSE: A software package to simulate the evolution of a pair of DNA sequences. Computer Applications in the Biosciences (CABIOS) 5: 47-50.

Rodriguez, F., J.L. Oliver, A. Marin & J.R. Medina. The general stochastic model of nucleotide substitution. Journal of Theoretical Biology (in the press).

Fundamentals

In the study of molecular evolution, DNA sequences are more informative than protein sequences, since there are synonymous codons as well as many DNA sequences which are not translated to protein. The evolutionary change of DNA occurs either by nucleotide substitution, or by deletion and insertion. SDSE is an integrated program package designed to simulate DNA sequence evolution through nucleotide substitution. There are many schemes under which base substitutions can occur (see Nei, 1987 for a review); SDSE allows the simulation of DNA evolution under any of these schemes. The simulation is carried out on finite, variable length, DNA sequences through a strict stochastic process adapted to the particular substitution rates imposed by each scheme. SDSE incorporates a given number of random substitutions into an 'ancestral' nucleotide sequence, thus obtaining an 'evolved' sequence; version 2.0 also allows the estimation of the average number of nucleotide substitutions per nucleotide site accumulated by two of these evolved DNA sequences. Detailed descriptions and applications of this software package have been reported elsewhere (Oliver et al., 1989; Rodriguez et al., 1989).

When two simulated descendent sequences accumulate a high number of nucleotide substitutions (as it occurs in actual sequences diverged long time ago), multiple and parallel changes may have taken place; if so, the observed proportion of different nucleotides between the two sequences must be an underestimate of the true number of nucleotide substitutions. To avoid this underestimation, various statistical methods have been proposed: Jukes-Cantor (1969) single parameter (JC) method, Kimura's (1980) two-parameter (K2P) method, Kimura's (1981) three-substitution-type (K3ST) method, Takahata & Kimura's (1981) (TK) method, Gojobori et al's (1982) (GIN) method, Tajima & Nei's (1984) (TN) method, etc. In order to compare the validity of these methods and determine their range of applicability in comparison with other models, a computer simulation of DNA sequence evolution must be performed.

For convenience the program used a discrete time, rather than a continuous time, approach. Equilibrium frequencies for a given substitution scheme were computed by squaring the corresponding transition matrix repeatedly. Squaring was continued until all the elements in each column have the same value (Gojobori et al., 1982); the equilibrium composition is equal to any row and, as the process starts at the equilibrium composition, it stays at it.

For each scheme, a random DNA sequence of a given number of nucleotides at equilibrium frequencies is generated, which is then used as an ancestral sequence in the simulation. From this sequence, a descendent sequence can be obtained by nucleotide substitution according to any of the schemes mentioned above. Another run of the program provides a second descendent sequence which can be then compared to the first one.

The evolution of the ancestral sequence to produce a descendent sequence is simulated as follows. A nucleotide site in the sequence is chosen at random. Any of the four nucleotides (A, T, C or G) may be at this site. Let T, for example, be the nucleotide occupying this site. The range 0-1 is divided into four segments and the transition matrix is used to assign the length of each segment; the length of the first segment is proportional to the probability that T does not change (TÿÄ>ÿT); the length of the three remaining segments are made proportional to the probabilities that T is substituted by A (TÿÄ>ÿA), by C (TÿÄ>ÿC) or by G (TÿÄ>ÿG). A random number in the range 0-1 is then generated. The original T is not changed if this number lay within the first segment, but is substituted by A, C or G if the random number lay within the second, third or fourth segment, respectively. The same is done with any nucleotide occupying the chosen site. A similar process is repeated the desired number of steps, thus generating an "evolved" sequence.

Every pair of "evolved" sequences obtained are then compared and the matrix of nucleotide pair frequencies for each sequence pair is computed. The different methods of estimating nucleotide substitutions are applied to these nucleotide pair frequency data to estimate the evolutionary distances.

Description of the programs

The simulation itself is carried out by the program XSIM, but, in order to facilitate its use, other five utilities have been added to the package: XSET, XSCH, XSEC and XDIV. The data disk drive can be set up with XSET; by means of XSCH one can edit any nucleotide substitution scheme and compute the corresponding nucleotide equilibrium frequencies; the program XSEC generates a random DNA sequence with the given (e.g.ÿequilibrium) nucleotide frequencies, which then can be used as an ancestral sequence; the program XSIM simulates the evolution of this ancestral sequence using the substitution rates of the chosen scheme and generates an "evolved" sequence having experienced the desired number of random nucleotide changes; lastly, the program XDIV compares two "evolved" sequences and computes several indexes of evolutionary distance. All the programs are menu driven and interfaced to the system through a principal menu.

All sequence data files used and generated by these programs conform to the standard GenBank database format (Burks et al., 1985; Bilofsky et al., 1986), thus allowing, on the one hand, to use as ancestral sequence any sequence retrieved from this database and, on the other hand, the application of other standard packages to analyze or manipulate simulated sequences. With sequences retrieved from the EMBL nucleotide sequence databank (Hamm & Cameron, 1986), one can first convert the file to the GenBank format by using some available utility, e.g. the program AUTHORIN (GenBank, IntelliGenetics, Inc., 1989). Since all files generated by SDSE are written in ASCII, sequences or substitution schemes can be easily handled through DOS commands; for example, they can be viewed on the screen by using the DOS command TYPE <filename> or directed to any standard printer by using the DOS command COPY <filename> PRN, etc.

SDSE V2.0 was developed on an IBM/PS2 80/111 computer running MS-DOS version 4.0, but it will run on any IBM/PC, IBM/PS2 or compatible computer running DOS 3.2 or higher. The minimum system configuration is 256K random access memory (RAM) and two disk drives. The programs are written in FORTRAN77 (Microsoft Fortran V4.01) and are available in machine (compiled) code. The distributed implementation does not require the numeric coprocessor but, if present, it should not interfere the execution of the programs. A compiled version exploiting the 80387 coprocessor is also available from the author.

Using the programs

After connecting the computer and boot the DOS, you enter the programs simply by putting the program disk in the default drive and typing SDSE. You can also copy all the programs on the distribution diskette to a subdirectory on the hard disk and type SDSE. The following menu appears on the screen:

SDSE V2.0: Simulation of DNA Sequence Evolution

1. Set up data disk

2. Enter substitution schemes

3. Generate random DNA sequences

4. Simulate DNA evolution

5. Estimate divergence

6. Exit to DOS

1.- The first choice is to set up the data disk drive. This program configures the package for your system. For example, programs can go in drive A and data on drive B, or vice versa. You can intend any other configuration, e.g. copy the programs to a subdirectory on the hard disk and use also this (or other) subdirectory as data disk.

2.- This option allows to edit substitution schemes and computes the corresponding equilibrium nucleotide frequencies. Through an interactive process you can fix the rates of nucleotide substitution, assigning values to the different parameters used by the most known schemes (see Fig. 1): Jukes-Cantor (1969) single parameter (JC) method, Kimura's (1980) two-parameter (K2P) method, Kimura's (1981) three-substitution-type (K3ST) method, Takahata & Kimura's (1981) (TK) method, Gojobori et al's (1982) (GIN) method, and Tajima & Nei's (1984) (TN) method. You can also enter directly the twelve substitution rates, thus using a nonÄstandard substitution scheme (OT). The edited scheme and the resulting equilibrium frequencies are first shown in the screen and then written to a file in your data disk, whose filename will be terminated in .SCH. Thus, you can have a file JC.SCH for the JukesÄCantor substitution scheme, a file K2P.SCH for the KimuraÄ2P, etc.

3.- By choosing option 3 you may generate random DNA sequences with the nucleotide frequencies you wish. This utility allows, for example, generate an ancestral sequence with the equilibrium frequencies corresponding to a particular substitution scheme (computed with option 2). The maximum number of nucleotides allowed was 3000; a version for larger sequences can be obtained from the author.

Since the generated sequence must be structured in compliance with GenBank format (Burks et al, 1985; Bilofsky et al., 1986), this program demand information for the directory entry of this database. The directory is the first section of the sequence file. The following directory lines are implemented by the program. The LOCUS line has the name of the sequence (a short unique name for the entry, chosen to suggest the sequence's DEFINITION), the number of nucleotides and the date the sequence was generated. The DEFINITION line includes a concise description of the sequence. In the COMMENT line the program includes both the demanded and the resulting nucleotide frequencies. The BASE COUNT line gives the different nucleotide counts. The ORIGIN line is the last line in the directory, immediately preceding the sequence itself, and describes the source of the sequence (e.g. random generated).

4.- This option simulates the process of DNA evolution. Following a selected substitution scheme, the program incorporates the demanded number of random nucleotide substitutions on the ancestral sequence, thus generating an "evolved" sequence, which is written, under the name you wish, to the data disk. One may generate two o more "evolved" sequences from a same ancestral sequence and compare them with the following utility.

5.- The option 5 allows you to estimate the divergence accumulated by two "evolved" sequences through several known algorithms. These includes Jukes and Cantor (1969), Kimura (1980, 1981), Takahata and Kimura (1981) Gojobori, Ishii and Nei (1982) and Tajima and Nei (1984). The program also gives the matrix of nucleotide pair frequencies and the true divergence suffered by the sequences. All these results are first shown in the screen, and then written to the file XDIV.RES on the data disk; you can obtain a hardcopy of this file by using the DOS command: COPY XDIV.RES PRN. Version 2.0 also allows to compute all the above mentioned statistics on the basis of the matrix of nucleotide pair frequencies (computed through any other way).

6.- By choosing this option you quit the program and return to DOS.

References

Bilofsky, H.S., C. Burks, J.W. Fickett, W.B. Goad, F.L. Lewitter, W.P. Rindone, C.D. Swindell, and C.S. Tung. 1986. The GenBank genetic sequence databank. Nucl. Acids Res. 14: 1Ä4.

Burks, C., Fickett, J.W., Goad, W.B., Kanehisa, M., Lewitter, F.I., Rindone, W.P., Swindell, C.D., Tung, C.S., and Bilofsky, H.S. 1985. The GenBank nucleic acid sequence database. Computer Applications in the Biosciences (CABIOS) 1: 225-233.

Gojobori, T., Ishii, K., and Nei, M. 1982. Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotide. J. Mol. Evol. 18: 414-423.

Hamm, G.H. and G.N. Cameron. 1986. The EMBL data library. Nucl. Acids Res. 14: 5-9.

Jukes, T.H., and Cantor, C.R. 1969. Evolution of proteins molecules, in Munro, H.N. (ed.) Mammalian protein metabolism. Academic Press, New York, pp. 21123.

Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.

Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA. 78: 454-458.

Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New York.

Oliver, J.L., A. Marin & J.R. Medina. 1989. SDSE: A software package to simulate the evolution of a pair of DNA sequences. Computer Applications in the Biosciences (CABIOS) 5: 47-50.

Rodriguez, F., J.L. Oliver, A. Marin & J.R. Medina. 1989. The general stochastic model of nucleotide substitution. Journal of Theoretical Biology (in the press).

Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1: 269-285.

Takahata, N. and Kimura, M. 1981. A model of evolutionary base substitutions and its application with special reference to rapid change of pseudogenes. Genetics 98: 641-657.