INTRODUCTION TO MOLECULAR BIOLOGY
DATABASES AND SEQUENCE ANALYSIS
9-13 March 1998
Tore Samuelsson
Göteborg University
Dept of Medical Biochemistry
The flow of genetic information and bioinformatics
DNA -> RNA -> protein -> conformation
Databases in bioinformatics
- Structure
PDB (Protein Data Bank, Brookhaven Natl Lab,
www.pdb.bnl.gov)
Xray
crystallography
NMR
modeling
KLOTHO
(small molecules, www.ibc.wustl.edu/klotho/)
- Sequence
DNA
Genbank
(www.ncbi.nlm.nih.gov)
- Homologisökning:
/BLAST/)
- Entrez : /Entrez/
GSDB (Genome sequence database)
EMBL (European Molecular Biology
Laboratory, www.ebi.ac.uk)
-
SRS : srs.ebi.ac.uk:5000
DDBJ
(DNA Data Bank of Japan)
Protein
Swissprot
(www.ebi.ac.uk)
- Genome
GDB
(Human Genome Data Base, gdbwww.gdb.org)
Mouse
genome database (www.informatics.jax.org)
Yeast
genome (genome-ftp.stanford.edu/Saccharomyces)
Bacterial
genomes (www.tigr.org)
- Genetic disorders
OMIM
(Online Mendelian Inheritance in Man,
www.ncbi.nlm.nih.gov)
- Taxonomy
(www.ncbi.nlm.nih.gov)
- Prosite (expasy.hcuge.ch)
- Literature
Medline
(www.ncbi.nlm.nih.gov)
Structure databases.
The Brookhaven Protein Data Bank
PDB: Number of Entries
Deposited and Released by Year
as of 02/11/98
Year |
Deposited |
Released |
1973 |
10 |
10 |
1974 |
7 |
2 |
1975 |
18 |
21 |
1976 |
47 |
27 |
1977 |
27 |
38 |
1978 |
26 |
26 |
1979 |
32 |
31 |
1980 |
20 |
30 |
1981 |
39 |
26 |
1982 |
59 |
47 |
1983 |
27 |
50 |
1984 |
36 |
29 |
1985 |
27 |
30 |
1986 |
28 |
29 |
1987 |
64 |
28 |
1988 |
129 |
79 |
1989 |
192 |
89 |
1990 |
306 |
164 |
1991 |
512 |
205 |
1992 |
635 |
226 |
1993 |
922 |
849 |
1994 |
1111 |
1392 |
1995 |
1221 |
1002 |
1996 |
1448 |
1224 |
1997 |
1848 |
1640 |
1998 |
207 |
262 |
NOTE: Computing the totals for number of entries released will produce a sum
greater than what is currently available from PDB. This is due to entries being
withdrawn and/or replaced.
Number of Entries Deposited (Bar)
and Average Time to Release (Line)
Accumulated and Averaged on a Quarterly Basis
Bar Graph -
Number of Entries in the Following Categories:
·
OnHold - (red) On-hold per depositor request
·
Processing - (green) Being
processed
·
Released - (blue) Released
Line
Graph
AVERAGE - Average number of days to release
The data were
accumulated and averaged on a quarterly basis. The average turn around times
for entries now being processed are estimated based on the average of the last
12 months.
Data for the
last quarter are accumulated until the date specified on the graph.
View Data in Tabular
Form Statistics
on contents and growth of the PDB
PDB Holdings List
Entries loaded on February 25, 1998
7163
Released Atomic Coordinate Entries
1736
Structure Factor Files
400 NMR Restraint Files
Molecule
Type
6339 proteins, peptides, and viruses
284 protein/nucleic acid complexes
528 nucleic acids
12 carbohydrates
0 others
Experimental
Technique
5850 diffraction and other
1133 NMR
180 theoretical modeling
Count
By Experiment
Diffraction
5261 proteins,peptides, and viruses
346 nucleic acids
233 protein/nucleic acid complexes
10 carbohydrates
0 others
NMR
922 proteins,peptides, and viruses
167 nucleic acids
42 protein/nucleic acid complexes
2 carbohydrates
0 others
Model
156 proteins,peptides, and viruses
15 nucleic acids
9 protein/nucleic acid complexes
0 carbohydrates
0 others
HEADER
HORMONE
30-OCT-92 1BPH 1BPH
2
COMPND INSULIN
(CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 1BPH 3
SOURCE BOVINE
(BOS $TAURUS) PANCREAS 1BPH 4
AUTHOR
O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH
5
REVDAT 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1BPHA 1
REVDAT 1 15-JAN-93 1BPH 0 1BPH 6
JRNL
AUTH
O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 7
JRNL
TITL CONFORMATIONAL CHANGES IN
CUBIC INSULIN CRYSTALS 1BPH 8
JRNL TITL 2
IN THE PH RANGE 7-11 1BPH 9
JRNL
REF BIOPHYS.J.
V. 63 1210 1992 1BPH 10
JRNL
REFN ASTM BIOJAU US ISSN 0006-3495 030
1BPH 11
REMARK 1
1BPH 12
REMARK 1
REFERENCE 1 1BPH 13
REMARK 1 AUTH
O.GURSKY,Y.LI,J.BADGER,D.L.D.CASPAR 1BPH 14
REMARK 1 TITL
MONOVALENT CATION BINDING IN CUBIC INSULIN 1BPH 15
REMARK 1 TITL 2 CRYSTALS 1BPH 16
REMARK 1 REF
BIOPHYS.J.
V. 61 604 1992 1BPH 17
REMARK 1 REFN
ASTM BIOJAU US ISSN
0006-3495 030 1BPH
18
REMARK 1
REFERENCE 2 1BPH 19
REMARK 1 AUTH
J.BADGER 1BPH 20
REMARK 1 TITL
FLEXIBILITY IN CRYSTALLINE INSULINS 1BPH 21
REMARK 1 REF
BIOPHYS.J. V.
61 816 1992 1BPH
22
REMARK 1 REFN
ASTM BIOJAU US ISSN
0006-3495 030 1BPH
23
REMARK 1
REFERENCE 3 1BPHA 2
REMARK 1 AUTH
J.BADGER,M.R.HARRIS,C.D.REYNOLDS,A.C.EVANS, 1BPH 25
REMARK 1 AUTH 2
E.J.DODSON,G.G.DODSON,A.C.T.NORTH 1BPH 26
REMARK 1 TITL
STRUCTURE OF THE PIG INSULIN DIMER IN THE CUBIC 1BPH
27
REMARK 1 TITL 2 CRYSTAL 1BPH 28
REMARK 1 REF
ACTA CRYSTALLOGR.,SECT.B
V. 47 127 1991 1BPH 29
REMARK 1 REFN
ASTM ASBSDK DK ISSN
0108-7681 622 1BPH
30
REMARK 1
REFERENCE 4 1BPHA 3
REMARK 1 AUTH
J.BADGER,D.L.D.CASPAR 1BPH 32
REMARK 1 TITL
WATER STRUCTURE IN CUBIC INSULIN CRYSTALS 1BPH 33
REMARK 1 REF
PROC.NAT.ACAD.SCI.USA V.
88 622 1991 1BPH 34
REMARK 1 REFN
ASTM PNASA6 US ISSN
0027-8424 040 1BPH
35
REMARK 1
REFERENCE 5 1BPHA 4
REMARK 1 AUTH
E.J.DODSON,G.G.DODSON,A.LEWITOVA,M.SABESAN 1BPH 37
REMARK 1 TITL
ZINC-FREE CUBIC PIG INSULIN: CRYSTALLIZATION AND 1BPH
38
REMARK 1 TITL 2 STRUCTURE DETERMINATION 1BPH 39
REMARK 1 REF
J.MOL.BIOL. V.
125 387 1978 1BPH 40
REMARK 1 REFN
ASTM JMOBAK UK ISSN
0022-2836 070 1BPH
41
REMARK 2
1BPH 42
REMARK 2
RESOLUTION. 2.0 ANGSTROMS. 1BPH 43
REMARK 3 1BPH 44
REMARK 3
REFINEMENT. 1BPH 45
REMARK 3 PROGRAM PROLSQ 1BPH
46
REMARK 3 AUTHORS HENDRICKSON AND KONNERT 1BPH 47
REMARK 3 R VALUE 0.160 1BPH
48
REMARK 3 RMSD BOND DISTANCES 0.014
ANGSTROMS
1BPH 49
REMARK 3 RMSD BOND ANGLE DISTANCES 0.043
ANGSTROMS
1BPH 50
REMARK 4
1BPH 51
REMARK 4 THIS
CRYSTAL FORM CONTAINS ONE INSULIN MOLECULE PER 1BPH 52
REMARK 4
ASYMMETRIC UNIT. THE SOLVENT VOLUME IS
64 PERCENT OF THE 1BPH 53
REMARK 4 CRYSTAL
VOLUME. THERE ARE MANY ALTERED SIDE CHAIN TORSION 1BPH 54
REMARK 4 ANGLES
AND MAIN CHAIN DISPLACEMENTS IN THE CUBIC CRYSTAL 1BPH 55
REMARK 4
STRUCTURE COMPARED TO OTHER INSULIN CRYSTAL FORMS. ABOUT 1BPH 56
REMARK 4 30 PER
CENT OF THE AMINO ACID RESIDUES CAN ADOPT MULTIPLE 1BPH 57
REMARK 4
CONFORMATIONS WHICH WERE RELIABLY IDENTIFIED BY COMPARISON 1BPH
58
REMARK 4 OF THE
DATA SETS COLLECTED FROM THE CRYSTALS IN THE PH 1BPH 59
REMARK 4 RANGE 7
- 11. THE WEIGHTS OF MANY OF SUCH
MULTIPLE PROTEIN 1BPH 60
REMARK 4 AND
SOLVENT CONFORMATIONS DEPEND ON SOLVENT IONIC 1BPH 61
REMARK 4
CONDITIONS (PH AND SALT CONCENTRATION). 1BPH
62
REMARK 5
1BPH 63
REMARK 5 THERE
ARE FOUR RELATED ENTRIES: 1BPH
64
REMARK 5 1APH
0.1M SODIUM SALT SOLUTION AT PH 7 1BPH 65
REMARK 5 1BPH
0.1M SODIUM SALT SOLUTION AT PH 9 1BPH 66
REMARK 5 1CPH
0.1M SODIUM SALT SOLUTION AT PH 10 1BPH 67
REMARK 5 1DPH
1.0M SODIUM SALT SOLUTION AT PH 11 1BPH 68
REMARK 6 1BPH 69
REMARK 6 IN 1BPH
AND 1CPH, THE SIDE CHAIN OF GLU A 4 CAN ADOPT TWO 1BPH 70
REMARK 6
ALTERNATIVE POSITIONS WHICH OVERLAP.
THEIR RELATIVE WEIGHT 1BPH 71
REMARK 6 AND THE
ATOMIC POSITIONS OF THE SECOND CONFORMER ARE NOT 1BPH 72
REMARK 6
ACCURATELY DETERMINED. 1BPH 73
REMARK 7
1BPH 74
REMARK 7 IN
1APH, 1BPH, AND 1DPH, THE SIDE CHAIN OF GLU B 21 IS 1BPH 75
REMARK 7
DISORDERED. IT HAS BEEN MODELED AS
SUPERPOSITION OF TWO 1BPH 76
REMARK 7
CONFORMATIONS BUT ATOMIC POSITIONS FOR THESE CONFORMATIONS 1BPH
77
REMARK 7 ARE
PROBABLY NOT VERY ACCURATE. 1BPH
78
REMARK 8
1BPH 79
REMARK 8 THE
SIDE CHAIN OF LYS B 29 IS POORLY DEFINED IN THE 1BPH 80
REMARK 8
ELECTRON DENSITY MAPS. IN 1APH AND
1CPH, IT IS INCLUDED 1BPH 81
REMARK 8 WITH
PARTIAL OCCUPANCY. IN 1BPH AND 1DPH,
ITS COORDINATES 1BPH 82
REMARK 8 HAVE
BEEN OMITTED FROM THE ENTRY. 1BPH
83
REMARK 13 1BPHA 5
REMARK 13
CORRECTION. RENUMBER REFERENCES SEQUENTIALLY.
INSERT 1BPHA 6
REMARK 13 MISSING HET AND FORMUL RECORDS FOR NA. 31-OCT-93. 1BPHA 7
SEQRES 1 A 21
GLY ILE VAL GLU GLN CYS CYS ALA SER VAL CYS SER LEU 1BPH 106
SEQRES 2 A 21
TYR GLN LEU GLU ASN TYR CYS ASN 1BPH 107
SEQRES 1 B 30
PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU 1BPH 108
SEQRES 2 B 30
ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR 1BPH 109
SEQRES 3 B 30
THR PRO LYS ALA 1BPH 110
HET DCE 200
4
1,2-DICHLOROETHANE(ETHYLENE DICHLORIDE)
1BPH 111
HET NA 88
1 SODIUM ION 1BPHA 8
FORMUL 3 DCE
C2 H4 CL2 1BPH 112
FORMUL 4 NA
NA1 1BPHA 9
FORMUL 5 HOH
*55(H2 O1) 1BPHA 10
HELIX 1 A1 GLY A
1 VAL A 10 1 1BPH 114
HELIX 2 A2 SER A
12 GLU A 17
5 NOT IDEAL
1BPH 115
HELIX 3 B1 SER B
9 GLY B 20
1
1BPH 116
TURN 1 1B1 CYS
B 19
ARG B 22 1BPH 117
TURN 2 1B2 GLY
B 20
GLY B 23 1BPH 118
SSBOND 1 CYS
A 6 CYS A 11 1BPH 119
SSBOND 2 CYS
A 7 CYS B 7 1BPH 120
SSBOND 3 CYS
A 20 CYS B 19 1BPH 121
CRYST1
78.900 78.900 78.900
90.00 90.00 90.00 I 21 3 24 1BPH 122
ORIGX1
1.000000 0.000000 0.000000 0.00000
1BPH 123
ORIGX2
0.000000 1.000000 0.000000 0.00000
1BPH 124
ORIGX3
0.000000 0.000000 1.000000 0.00000
1BPH 125
SCALE1
0.012674 0.000000 0.000000 0.00000
1BPH 126
SCALE2
0.000000 0.012674 0.000000 0.00000
1BPH 127
SCALE3
0.000000 0.000000 0.012674 0.00000
1BPH 128
ATOM 1 N
GLY A 1 13.994
47.196 31.798 1.00 35.87 1BPH 129
ATOM 2 CA
GLY A 1 14.277
46.226 30.708 1.00 38.67 1BPH 130
ATOM 3 C
GLY A 1 15.574
45.507 31.085 1.00 31.18 1BPH 131
ATOM 4 O
GLY A 1 16.078
45.660 32.217 1.00 22.60 1BPH 132
ATOM 5 N
ILE A 2 16.088
44.766 30.126 1.00 28.39 1BPH 133
ATOM 6 CA
ILE A 2 17.342
44.034 30.404 1.00 23.76 1BPH 134
ATOM 7 C
ILE A 2 18.526
44.939 30.686 1.00 25.29 1BPH 135
ATOM 8 O
ILE A 2 19.425
44.457 31.392 1.00 18.74 1BPH 136
ATOM 9 CB
ILE A 2 17.571
43.072 29.158 1.00 27.36 1BPH 137
ATOM 10 CG1 ILE A
2 18.638 42.049
29.605 1.00 18.03 1BPH 138
ATOM 11 CG2 ILE A
2 17.859 43.936
27.903 1.00 25.54 1BPH 139
ATOM 12 CD1 ILE A
2 18.914 40.930
28.590 1.00 17.07 1BPH 140
ATOM 13 N
VAL A 3 18.619
46.195 30.192 1.00 24.42 1BPH 141
ATOM 14 CA
VAL A 3 19.774
47.080 30.436 1.00 30.26 1BPH 142
ATOM 15 C
VAL A 3 19.952
47.453 31.895 1.00 19.08 1BPH 143
ATOM 16 O
VAL A 3 21.018
47.421 32.561 1.00 28.15 1BPH 144
ATOM 17 CB
VAL A 3 19.719
48.274 29.462 1.00 33.87 1BPH 145
ATOM 18 CG1 VAL A
3 20.847 49.225
29.754 1.00 30.40 1BPH 146
ATOM 19 CG2 VAL A
3 19.868 47.724
28.044 1.00 24.51 1BPH 147
.
.
.
ATOM 127
N GLU A 17
17.257 34.367 30.913
1.00 17.57 1BPH 255
ATOM 128 CA
GLU A 17 16.353
33.393 30.338 1.00 13.26 1BPH 256
ATOM 129 C
GLU A 17 14.968
33.889 30.001 1.00 22.70 1BPH 257
ATOM 130 O
GLU A 17 14.234
33.275 29.212 1.00 25.00 1BPH 258
ATOM 131 CB
GLU A 17 16.183
32.146 31.209 1.00 17.01 1BPH 259
ATOM 132 CG
GLU A 17 17.252
31.160 30.695 1.00 14.38 1BPH 260
ATOM 133 CD
GLU A 17 16.968
29.843 31.385 1.00 24.91 1BPH 261
ATOM 134 OE1 GLU A
17 16.230 29.713
32.350 1.00 25.72 1BPH 262
ATOM 135 OE2 GLU A
17 17.675 28.984
30.830 1.00 22.42 1BPH 263
ATOM 136 N
ASN A 18 14.618
35.021 30.563 1.00 22.30 1BPH 264
ATOM 137 CA
ASN A 18 13.371
35.753 30.369 1.00 29.65 1BPH 265
ATOM 138 C
ASN A 18 13.330
36.318 28.943 1.00 23.17 1BPH 266
ATOM 139 O
ASN A 18 12.197
36.611 28.486 1.00 30.58 1BPH 267
ATOM 172 N
PHE B 1 28.961
32.694 34.302 1.00 38.09 1BPH 300
ATOM 173 CA
PHE B 1 29.545
33.933 33.691 1.00 44.75 1BPH 301
ATOM 174 C
PHE B 1 28.483
35.030 33.562 1.00 18.46 1BPH 302
ATOM 175 O
PHE B 1 28.656
36.170 33.083 1.00 29.15 1BPH 303
ATOM 176 CB
PHE B 1 30.190
33.486 32.346 1.00 36.50 1BPH 304
ATOM 177 CG
PHE B 1 29.191
32.986 31.322 1.00 29.77 1BPH 305
ATOM 178 CD1 PHE B
1 28.691 31.688
31.351 1.00 22.29 1BPH 306
ATOM 179 CD2 PHE B
1 28.736 33.844
30.327 1.00 30.11 1BPH 307
ATOM 180 CE1 PHE B
1 27.758 31.234
30.415 1.00 30.11 1BPH 308
ATOM 181 CE2 PHE B
1 27.822 33.423
29.377 1.00 29.49 1BPH 309
ATOM 182 CZ
PHE B 1 27.329
32.125 29.428 1.00 27.29 1BPH 310
ATOM 183 N
VAL B 2 27.235
34.671 33.935 1.00 25.09 1BPH 311
ATOM 184 CA
VAL B 2 26.085
35.571 33.793 1.00 23.88 1BPH 312
ATOM 185 C
VAL B 2 25.902
36.506 34.969 1.00 24.42 1BPH 313
ATOM 186 O
VAL B 2 25.269
37.560 34.801 1.00 19.63 1BPH 314
ATOM 187 CB
VAL B 2 24.846
34.751 33.391 1.00 28.89 1BPH 315
.
ATOM 413 N
PRO B 28 16.809
47.082 24.129 1.00 39.30 1BPH 541
ATOM 414 CA
PRO B 28 17.550
47.958 25.065 1.00 50.32 1BPH 542
ATOM 415 C
PRO B 28 16.747
49.100 25.692 1.00 51.41 1BPH 543
ATOM 416 O
PRO B 28 16.922
49.526 26.848 1.00 52.87 1BPH 544
ATOM 417 CB
PRO B 28 18.744
48.435 24.231 1.00 33.07 1BPH 545
ATOM 418 CG
PRO B 28 18.261
48.353 22.779 1.00 28.91 1BPH 546
ATOM 419 CD
PRO B 28 17.355
47.133 22.751 1.00 30.72 1BPH 547
ATOM 420 N
LYS B 29 15.830
49.593 24.905 1.00 58.03 1BPH 548
ATOM 421 CA ALYS B
29 14.935 50.708
25.214 0.50 56.38 1BPH 549
ATOM 422 CA BLYS B
29 15.106 50.841
24.970 0.50 57.81 1BPH 550
ATOM 423 C
ALYS B 29 13.602
50.396 25.876 0.50 73.09 1BPH 551
ATOM 424 C
BLYS B 29 13.915
50.201 25.692 0.50 66.40 1BPH 552
ATOM 425 O
ALYS B 29 13.044
51.332 26.517 0.50 80.92 1BPH 553
ATOM 426 O
BLYS B 29 12.908
49.842 25.053 0.50 53.34 1BPH 554
ATOM 427 CB ALYS B
29 14.689 51.541
23.932 0.50 58.98 1BPH 555
ATOM 428 CB BLYS B
29 14.658 51.386
23.598 0.50 45.66 1BPH 556
ATOM 429 N
AALA B 30 13.056
49.194 25.782 0.50 74.55 1BPH 557
ATOM 430 N
BALA B 30 14.075
50.102 27.005 0.50 71.75 1BPH 558
ATOM 431 CA AALA B
30 11.762 48.878
26.416 0.50 75.29 1BPH 559
ATOM 432 CA BALA B
30 13.075 49.536
27.915 0.50 73.80 1BPH 560
ATOM 433 C
AALA B 30 11.853
47.818 27.515 0.50 68.10 1BPH 561
ATOM 434 C
BALA B 30 12.867
50.426 29.144 0.50 73.94 1BPH 562
ATOM 435 O
AALA B 30 10.774
47.235 27.799 0.50 65.90 1BPH 563
ATOM 436 O
BALA B 30 12.394
49.828 30.144 0.50 69.68 1BPH 564
ATOM 437 CB AALA B
30 10.728 48.457
25.375 0.50 76.93 1BPH 565
ATOM 438 CB BALA B
30 13.512 48.144
28.366 0.50 73.70 1BPH 566
ATOM 439 OXTAALA B
30 12.952 47.610
28.048 0.50 63.45 1BPH 567
ATOM 440 OXTBALA B
30 13.182 51.623
29.061 0.50 76.41 1BPH 568
TER 441 ALA B
30 1BPH 569
HETATM 442
CL1 DCE 200 26.950 41.213
19.536 0.50 34.85 1BPH 570
HETATM 443 C1
DCE 200 28.222 40.003 20.178 0.50 24.42 1BPH 571
HETATM 444 C2
DCE 200 28.307
38.776 19.363 0.50 24.99 1BPH 572
HETATM 445
CL2 DCE 200 26.941 37.681
19.833 0.50 33.75 1BPH 573
HETATM 446
NA NA 88 20.339 43.145
38.263 0.50 13.22 1BPH 574
HETATM 447 O
HOH 1 26.102
28.408 28.110 0.33 28.57 1BPH 575
HETATM 448 O
HOH 2 26.719
28.525 28.242 0.66 30.29 1BPH 576
HETATM 449 O
HOH 3 19.213
33.037 38.295 1.00 42.10 1BPH 577
HETATM 450 O
HOH 4 21.104
32.216 20.645 1.00 26.61 1BPH 578
HETATM 451 O
HOH 5 21.954
33.637 38.117 1.00 22.77 1BPH 579
HETATM 498 O
HOH 52 19.217
52.503 35.050 1.00 68.12 1BPH 626
HETATM 499 O
HOH 53 15.376
24.434 25.540 1.00 82.81 1BPH 627
HETATM 500 O
HOH 54 21.768
55.234 32.076 1.00 85.97 1BPH 628
HETATM 501 O
HOH 55 22.667
52.737 33.359 1.00 81.22 1BPH 629
CONECT 48 47
78 1BPH 630
CONECT 54 53
235 1BPH 631
CONECT 78 48
77 1BPH 632
CONECT 161 160
331 1BPH 633
CONECT 235 54
234 1BPH 634
CONECT 331 161
330 1BPH 635
CONECT 442 443 1BPH 636
CONECT 443 442
444 1BPH 637
CONECT 444 443
445 1BPH 638
CONECT 445 444 1BPH 639
MASTER
97 0 2 3 0
2 0 6 499 2
10 5 1BPHA 11
END
1BPH
Rasmol www.umass.edu/microbio/rasmol/
Weblab www.msi.com
Kinemage www.cryst.bbk.ac.uk/PPS/vsns-pps/technology/kinemage.html
Chime www.umass.edu/microbio/rasmol/
- DNA Sequence
Genbank
(www.ncbi.nlm.nih.gov)
GSDB
(Genome sequence database)
EMBL (European Molecular Biology
Laboratory)(www.ebi.ac.uk)
DDBJ
(DNA Data Bank of Japan)
Current statistics of EMBL:
(www.ebi.ac.uk)
The EMBL Nucleotide Sequence Database
The EMBL
Nucleotide Sequence Database is a comprehensive database of DNA and RNA
sequences collected from the scientific literature and patent applications and
directly submitted from researchers and sequencing groups. Data collection is
done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ).
The database currently doubles in size approximately every 12 months and
currently (December 1997) contains over 1,281 million bases from 1,917,868
sequence entries.
The complete
database is available every three months via subscription on CD-ROM. Computer
network services additionally provide access to the very latest data as well as
sequence similarity searches.
Documentation
·
Release Notes
·
Release 53,
December 1997.
·
Release 52,
October 1997.
·
Release 51,
June 1997.
·
Release 50,
March 1997.
·
Release 49,
December 1996.
·
Release 48,
September 1996.
·
User Manual,
December 1997.
·
DDBJ/EMBL/GenBank
Feature Table Definition, v2.0, December 15, 1997.
Access
·
Subscription to
full releases on CD-ROM.
·
FTP archive
·
Full release (Release
53, December 1997).
·
Updates, daily, weekly
and cumulative.
·
Entry retrieval by database
accession number.
·
E-mail access to
data and sequence search services.
·
EMBnet nodes.
<!-- hhmts start -->Last modified: Thu January 08 1998 17:42
Growth of EMBL
database
EMBL and Genbank formats
EMBL format
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011;
S78972;
XX
NI g44010
XX
DT 28-APR-1992
(Rel. 31, Created)
DT 30-JUN-1993
(Rel. 36, Last updated, Version 6)
XX
DE L.ivanovii
sod gene for superoxide dismutase
XX.
KW sod gene;
superoxide dismutase.
XX
OS Listeria
ivanovii
OC Eubacteria;
Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;
OC Listeria.
XX
RN [1]
RA Haas A.,
Goebel W.;
RT "Cloning
of a superoxide dismutase gene from Listeria ivanovii by
RT functional
complementation in Escherichia coli and
RT
characterization of the gene product.";
RL Mol. Gen.
Genet. 231:313-322(1992).
XX
RN [2]
RP 1-756
RA Kreft J.;
RT ;
RL Submitted
(21-APR-1992) on tape to the EMBL Data Library by:
RL J. Kreft,
Institut f. Mikrobiologie, Universitaet Wuerzburg,
RL Biozentrum Am
Hubland, 8700 Wuerzburg, FRG
XX
DR SWISS-PROT;
P28763; SODM_LISIV.
XX
FH Key Location/Qualifiers
FH
FT source 1..756
FT
/organism="Listeria ivanovii"
FT
/strain="ATCC 19119"
FT RBS 95..100
FT
/gene="sod"
FT CDS 109..717
FT
/gene="sod"
FT
/EC_number="1.15.1.1"
FT
/product="superoxide dismutase"
FT
/db_xref="PID:g44011"
FT
/db_xref="SWISS-PROT:P28763"
FT
/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAV
FT
SGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGN
FT
LKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGK
FT
TPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK*"
FT
terminator 723..746
FT
/gene="sod"
XX
SQ Sequence 756
BP; 247 A; 136 C; 151 G; 222 T; 0 other;
CGTTATTTAA GGTGTTACAT AGTTCTATGG AAATAGGGTC TATACCTTTC
GCCTTACAAT 60
GTAATTTCTT TTCACATAAA TAATAAACAA TCCGAGGAGG AATTTTTAAT
GACTTACGAA 120
TTACCAAAAT TACCTTATAC TTATGATGCT TTGGAGCCGA ATTTTGATAA
AGAAACAATG 180
GAAATTCACT ATACAAAGCA CCACAATATT TATGTAACAA AACTAAATGA
AGCAGTCTCA 240
GGACACGCAG AACTTGCAAG TAAACCTGGG GAAGAATTAG TTGCTAATCT
AGATAGCGTT 300
CCTGAAGAAA TTCGTGGCGC AGTACGTAAC CACGGTGGTG GACATGCTAA
CCATACTTTA 360
TTCTGGTCTA GTCTTAGCCC AAATGGTGGT GGTGCTCCAA CTGGTAACTT
AAAAGCAGCA 420
ATCGAAAGCG AATTCGGCAC ATTTGATGAA TTCAAAGAAA AATTCAATGC
GGCAGCTGCG 480
GCTCGTTTTG GTTCAGGATG GGCATGGCTA GTAGTGAACA ATGGTAAACT
AGAAATTGTT 540
TCCACTGCTA ACCAAGATTC TCCACTTAGC GAAGGTAAAA CTCCAGTTCT
TGGCTTAGAT 600
GTTTGGGAAC ATGCTTATTA TCTTAAATTC CAAAACCGTC GTCCTGAATA
CATTGACACA 660
TTTTGGAATG TAATTAACTG GGATGAACGA AATAAACGCT TTGACGCAGC
AAAATAATTA 720
TCGAAAGGCT CACTTAGGTG GGTCTTTTTA TTTCTA 756
GenBank Format
LOCUS
LISOD 756 bp DNA BCT
30-JUN-1993
DEFINITION
L.ivanovii sod gene for superoxide dismutase.
ACCESSION X64011
S78972
NID g44010
KEYWORDS sod
gene; superoxide dismutase.
SOURCE Listeria ivanovii.
ORGANISM Listeria ivanovii
Eubacteria; Firmicutes; Low G+C gram-positive bacteria;
Bacillaceae; Listeria.
REFERENCE 1 (bases 1 to 756)
AUTHORS Haas,A. and Goebel,W.
TITLE Cloning of a superoxide dismutase gene
from Listeria ivanovii by
functional complementation in Escherichia coli and characterization
of the
gene product
JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992)
MEDLINE 92140371
REFERENCE 2 (bases 1 to 756)
AUTHORS Kreft,J.
TITLE Direct Submission
JOURNAL Submitted (21-APR-1992) J. Kreft, Institut
f. Mikrobiologie,
Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG
FEATURES
Location/Qualifiers
source 1..756
/organism="Listeria ivanovii"
/strain="ATCC 19119"
/db_xref="taxon:1638"
RBS 95..100
/gene="sod"
gene 95..746
/gene="sod"
CDS 109..717
/gene="sod"
/EC_number="1.15.1.1"
/codon_start=1
/product="superoxide dismutase"
/db_xref="PID:g44011"
/db_xref="SWISS-PROT:P28763"
/transl_table=11
/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVS
GHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLK
AAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPV
LGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"
terminator 723..746
/gene="sod"
BASE COUNT
247 a 136 c 151 g
222 t
ORIGIN
1 cgttatttaa
ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat
61
gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa
121
ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg
181
gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca
241
ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt
301
cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta
361
ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca
421
atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg
481
gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt
541
tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat
601
gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca
661
ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta
721
tcgaaaggct cacttaggtg ggtcttttta tttcta
//
FORMAT OF THE EMBL DATABASE
The nucleotide
sequence database is composed of sequence entries. Each entry corresponds to a
single contiguous sequence as contributed or reported in the literature. In
many cases, entries have been assembled from several papers reporting
overlapping sequence regions. Conversely, a single paper often provides data
for several entries, as when homologous sequences from different organisms are
compared.
·
3.1 Classes of Data
·
3.2 Database Divisions
·
3.3 Structure of an
Entry
·
3.4 Line Structure
·
3.4.1 The ID Line
·
3.4.2 The AC Line
·
3.4.3 The NI Line
·
3.4.4 The DT Line
·
3.4.5 The DE Line
·
3.4.6 The KW Line
·
3.4.7 The OS Line
·
3.4.8 The OC Line
·
3.4.9 The OG Line
·
3.4.10 The Reference (RN, RC,
RP, RX, RA, RT, RL) Lines
·
3.4.10.1 The RN Line
·
3.4.10.2 The RC Line
·
3.4.10.3 The RP Line
·
3.4.10.4 The RX Line
·
3.4.10.5 The RA Line
·
3.4.10.6 The RT Line
·
3.4.10.7 The RL Line
·
3.4.11 The DR Line
·
3.4.12 The FH Line
·
3.4.13 The FT Line
·
3.4.14 The SQ Line
·
3.4.15 The Sequence Data
Line
·
3.4.16 The CC Line
·
3.4.17 The XX Line
·
3.4.18 The // Line
<!-- hhmts end -->
3.2.4 Feature key examples
Key
Description
conflict
Separate determinations of the "same" sequence differ
rep_origin
Origin of replication
protein_bind
Protein binding site on DNA
CDS
Protein-coding sequence
misc_RNA
Generic label for an undefined RNA
insertion_seq
Insertion element
D-loop
Mitochondrial or other D-loop structure
3.3.4 Qualifier examples
Key Location/Qualifiers
CDS 86..742
/product="hypoxanthine
phosphoribosyltransferase"
/label=hprt
/note="hprt catalyzes vital steps
in the
reutilization pathway for purine
biosynthesis
and its deficiency leads to forms of
""gouty"" arthritis"
rep.origin 234..243
/direction=left
CDS 109..564
/usedin=X10009:catalase
3.5.3 Location examples
The following is a list of common location descriptors
with their meanings:
Location Description
467 Points to a single base in the
presented sequence
340..565 Points to a continuous range
of bases bounded by and including the
starting and ending bases
<345..500
Indicates that the exact
lower boundary point of a
feature is unknown. The location begins at some
base previous to the first base
specified (which need
not be contained in the presented
sequence) and con-
tinues to and includes the ending
base
<1..888
The feature starts
before the first sequenced base and
continues to and includes base 888
(102.110) Indicates that the exact location is
unknown but that
it is one of the bases between bases
102 and 110, in-
clusive
(23.45)..600 Specifies that the starting point is
one of the bases between bases 23 and 45, inclusive, and the end point is base
600
(122.133)..(204.221) The feature
starts at a base between 122 and 133, inclusive, and ends at a base between 204
and 221, inclusive
123^124 Points to a site between bases 123
and 124
145^177 Points to a site between two adjacent
bases anywhere
between bases 145 and 177
complement(34..(122.126)) Start at
one of the bases complementary to those between 122 and 126 on the presented
strand and finish at the base complementary to base 34 (the feature is on the
strand complementary to the presented strand)
join("acct",449..670) Concatenate the four bases 'acct' to the 5' end of the
sequence from bases 449 to 670,
inclusive
J00193:hladr Points to a feature whose location is
described in an-
other entry: the feature
labelled 'hladr' in the entry (in this database) with primary accession number
'J00193'
J00194:(100..202) Points to bases 100 to 202, inclusive, in the entry (in
this database) with primary accession
number
'J00194'
1 RELEASE 53
The EMBL Nucleotide Sequence Database was frozen to make Release 53 on the
16th December 1997. The release contains 1,917,868 sequence entries comprising
1,281,391,651 nucleotides. This represents an increase of about 8% over Release
52. A breakdown of Release 53 by division is shown below:
Division Entries Nucleotides
----------------- ------- -----------
Bacteriophage 1388 2188305
ESTs 1343796 496603984
Fungi 18137 44602064
GSSs 100154 49099107
HTG 1868 102763872
Human 74384 139022655
Invertebrates 28126 107524431
Organelles 24715 22870076
Other Mammals 14429 13785092
Other
Vertebrates 13145 14653255
Plants 22136 37736590
Patent 91221 29511807
Prokaryotes 42666 102750354
Rodents 37043 46489741
STSs 51172 17685717
Synthetic 2424 5377292
Unclassified 2380 2387088
Viruses 48684 46340221
----------------- --------
------------
Total 1917868 1281391651
----------------- -------- ------------
hum Human Sequences | |
rod RodentSequences |>Mammals
|
mam Other Mammal Sequences |
|> Vertebrates
vrt Other Vertebrate Sequences |
STS (Sequence
Tagged Sites)
Sequence Tagged
Sites (STS) are short DNA segments with a single location in the
genome. This
feature of STS makes them useful tags for mapping.
EST (Expressed
Sequence Tag)
Description
Expressed Sequence Tags (ESTs) are sequences of cDNA which have been
reverse-transcribed
from mRNA and their function is not necessarily known. They have
applications in
the discovery of new genes, mapping of various genomes, and identification of
coding regions in
genomic sequences.
Genome Survey
Sequence (GSS)
Genome Survey
Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu
PCR sequences.
dbGSS release
030598 - Summary by Organism - March 5, 1998
Number
of public entries: 88,416
Homo
sapiens (human) 64,533
Arabidopsis
thaliana 22,511
Trypanosoma
brucei brucei 455
Trypanosoma
brucei rhodesiense 324
Cryptosporidium
parvum 200
Mus
musculus 163
Rhodobacter
sphaeroides 151
Enterococcus
faecalis 41
Helicobacter
pylori 21
Brugia
malayi 11
Leishmania
major 6
High-Throughput
Genomic Sequences
The High
Throughput Genomic (HTG) Sequences division was created to accommodate a
growing need to make 'unfinished' genomic sequence data rapidly available to
the scientific community. It was done in a coordinated effort between the three
International Nucleotide Sequence databases: DDBJ, EMBL, and GenBank. The HTG
division contains 'unfinished' DNA sequences generated by the high-throughput
sequencing centers. Sequence data in this division are available for BLAST
homology searches against either the "htgs" database or the
"month" database, which includes all new submissions for the prior
month. The HTG division of GenBank was described in a [Genome Research (1997) 7(10)]
article by Ouellette and Boguski.
Location of HTG
records:
Unfinished HTG
sequences containing contigs greater than 2 kb are
assigned an
accession number and deposited in the HTG division. A typical
HTG record might
consist of all the first pass sequence data generated from
a single cosmid,
BAC, YAC, or P1 clone which together comprise more
than 2 kb and
contain one or more gaps. A single accession number is
assigned to this
collection of sequences and each record includes a clear
indication of the
status (phase 1 or 2) plus a prominent warning that the
sequence data is
"unfinished" and may contain errors. The accession
number does not
change as sequence records are updated; only the most
recent version of
a HTG record remains in GenBank. 'Finished' HTG
sequences (phase
3) retain the same accession number, but are moved into
the relevant
primary GenBank division. An example of a submission (one
accession number)
that has progressed through phase 1, phase 2, and
phase 3 is
available
TIGR Microbial
Database
A listing of microbial genomes that have been
published or are in the process of being sequenced.
Published microbial genomes
|
Link |
Genome |
Strain |
Domain |
Size (Mb) |
Institution |
Funding |
Publication |
1 |
KW20 |
1.83 |
TIGR |
TIGR |
||||
2 |
G-37 |
|
0.58
|
TIGR |
||||
3 |
Methanococcus jannaschii |
DSM 2661 |
1.66 |
TIGR |
||||
4 |
Synechocystis sp. |
PCC 6803 |
|
3.57
|
|
|||
5 |
Mycoplasma pneumoniae |
M129 |
0.81 |
|
||||
6 |
Saccharomyces cerevisiae |
S288C |
13 |
International Consortium |
EC, NHGRI, Welcome Trust, McGill U., RIKEN |
|||
7 |
Helicobacter pylori |
26695 |
1.66 |
TIGR |
TIGR |
|||
8 |
Escherichia coli |
K-12 |
|
4.60
|
||||
9 |
Methanobacterium thermoautotrophicum |
delta H |
1.75 |
|||||
10 |
Bacillus subtilis |
168 |
4.20 |
International Consortium |
||||
11 |
Archaeoglobus fulgidus |
VC-16, DSM4304 |
2.18 |
TIGR |
||||
12 |
B31 |
1.44 |
TIGR |
Mathers Foundation |
Microbial genomes in progress
|
Genome |
Strain |
Domain |
Size (Mb) |
Institution |
Funding |
Anticipated Publication |
|
Actinobacillus actinomycetemcomitans |
|
2.2 |
|
|||
|
Aquifex aeolicus |
VF5 |
B |
1.50 |
|
manuscript submitted |
|
|
Houston 1 |
2.00 |
1999 |
||||
|
Caulobacter crescentus |
|
3.80 |
TIGR |
|
||
|
Chlamydia pneumoniae |
|
B |
1.00 |
TIGR |
|
|
|
Chlamydia trachomatis mouse pneumonitis |
|
B |
1.00 |
TIGR |
|
|
|
serovar D (D/UW-3/Cx) |
1.05 |
|
||||
|
Chlorobium tepidum |
|
2.10 |
TIGR |
|
||
|
ATCC 824 |
4.1 |
|
||||
|
R1 |
3.00 |
TIGR |
1998 |
|||
|
Dehalococcoides ethenogenes |
|
? |
TIGR |
|
||
|
Desulfovibrio vulgaris |
|
1.70 |
TIGR |
<TD></TD> <TD><B><I>Ehrlichia
species</I></B> (HGE agent)</TD> <TD><BR></TD> <TD><FONT
SIZE=5><B>B</B></FONT></TD> <TD>1.40 </TD> <TD><B>TIGR</B></TD> </TR>--> |
||
|
|
3.00 |
TIGR |
1999 |
|||
|
Francisella tularensis |
schu 4 |
2.00 |
European & North American consortium |
|
|
|
|
Halobacterium sp. |
NRC-1 |
2.50 |
University of Massachusetts / University
of Washington |
|
|
|
|
Halobacterium salinarium |
|
4.0 |
Max-Planck-Institute for Biochemistry |
|
|
|
|
Legionella pneumophila |
|
4.10 |
TIGR |
|
|
|
|
Mycobacterium avium |
4.70 |
TIGR |
2000 |
|||
|
CSU#93 (clinical isolate) |
4.40 |
TIGR |
1998 |
|||
|
H37Rv (lab strain) |
4.40 |
|
||||
|
Mycoplasma mycoides subsp. mycoides SC |
PG1 |
1.28 |
The Royal Institute of Technology,
Stockholm & The National Veterinary Institute, Uppsala |
|
1999 |
|
|
|
2.20 |
|
||||
|
MC58 |
2.30 |
TIGR |
TIGR |
|
||
|
serogroup A strain Z2491 |
2.30 |
|
||||
|
Plasmodium falciparum Chr1 (isolate 3D7) |
|
0.8 |
|
|||
|
Plasmodium falciparum Chr2 (isolate 3D7) |
|
1.00 |
TIGR / NMRI |
1998 |
||
|
Plasmodium falciparum Chr3 (isolate 3D7) |
|
1.20 |
|
|||
|
Plasmodium falciparum Chr4 (isolate 3D7) |
|
1.5 |
|
|||
|
Plasmodium falciparum Chr9 (isolate 3D7) |
|
1.80 |
TIGR / NMRI |
|
|
|
|
Plasmodium falciparum Chr10 (isolate 3D7) |
|
2.10 |
TIGR / NMRI |
|
|
|
|
Plasmodium
falciparum
Chr12 (isolate 3D7) |
|
2.4 |
|
|||
|
Plasmodium falciparum Chr14 (isolate 3D7) |
|
3.4 |
TIGR / NMRI |
|
||
|
Porphyromonas gingivalis |
W83 |
2.20 |
TIGR / Forsyth
Dental Center |
1999 |
||
|
PAO1 |
5.90 |
|
||||
|
Pseudomonas putida |
|
5.00 |
TIGR |
|
||
|
Pyrobaculum aerophilum |
|
2.22 |
Caltech
/ UCLA |
|
||
|
Pyrococcus furiosus |
|
2.10 |
Center of Marine Biotechnology / Univ.
Utah |
|
||
|
OT3 |
1.80 |
|
|
|||
|
SB1003 |
1.80 |
University of Chicago |
|
1998 |
||
|
Madrid E |
1.10 |
1998 |
||||
|
Salmonella typhimurium |
|
4.50 |
TIGR |
|
|
|
|
Shewanella putrefaciens |
MR-1 |
4.50 |
TIGR |
|
||
|
Staphylococcus aureus |
|
2.80 |
TIGR |
NIAID
/ MGRI |
|
|
|
type 4 |
2.20 |
TIGR |
TIGR / NIAID
/ MGRI |
1998 |
||
|
|
1.98 |
|
||||
|
A3(2) |
8.0 |
|
||||
|
|
3.05 |
Canadian & European Consortium |
|
|
||
|
MSB8 |
1.80 |
TIGR |
1998 |
|||
|
Thiobacillus ferroxidans |
|
2.90 |
TIGR |
|
||
|
Treponema denticola |
|
3.00 |
TIGR / Univ. Texas |
|
|
|
|
Nichols |
1.14 |
TIGR / Univ. Texas |
manuscript in preparation |
|||
|
Thermoplasma acidophilum |
|
1.7 |
Max-Planck-Institute for Biochemistry |
|
|
|
|
Ureaplasma urealyticum |
serovar 3 |
0.75 |
U. Alabama / PE-ABI |
PE-ABI / NIH / UAB |
1998 |
|
|
serotype O1, Biotype El Tor, strain N16961 |
2.50 |
TIGR |
1998 |
|||
|
8.1.b clone 9.a.5.c |
2.00 |
Brazilian Consortium |
2000 |
KEY
Domain |
A: Archaea |
B: Eubacteria |
E: Eucaryote |
Map of Mycoplasma genitalium genome.
The flow of genetic information
DNA -> RNA -> protein -> conformation
Translation products of DNA - Amino acids in three letter code
ValArgIleArgIleSerAsp
TyrGlyPheGlyPheArgMet
ThrAspSerAspPheGlyCys
5' GUACGGAUUCGGAUUUCGGAUGC
3'
3' CAUGCCUAAGCCUAAAGCCUACG
5'
TyrProAsnProAsnArgIle
ValSerGluSerLysProHis
ArgIleArgIleGluSerAla
Amino acids in one letter code
V R
I R I S D
Y G
F G F R M
T D
S D F G C
5' GUACGGAUUCGGAUUUCGGAUGC 3'
3' CAUGCCUAAGCCUAAAGCCUACG 5'
Y P
N P N R I
V S
E S K P H
R I
R I E S A
Three- and one-letter codes of the amino acids.
Alanine Ala A
Arginine Arg R
Asparagine Asn N
Aspartate Asp D
Cysteine Cys C
Glutamate Glu E
Glutamine Gln Q
Glycine Gly G
Histidine His H
Isoleucine Ile I
Leucine Leu L
Lysine Lys K
Metionine Met M
Fenylalanine Phe F
Proline Pro P
Serine Ser S
Treonine Thr T
Tryptofan Trp W
Tyrosine Tyr Y
Valine Val V
4. THE GENETIC
CODE
UUU Phe UCU Ser
UAU Tyr UGU
Cys
UUC Phe UCC Ser
UAC Tyr UGC
Cys
UUA Leu UCA Ser
UAA Stop UGA
Stop
UUG Leu UCG Ser
UAG Stop UGG
Trp
CUU Leu CCU Pro
CAU His CGU
Arg
CUC Leu CCC Pro
CAC His CGC
Arg
CUA Leu CCA Pro
CAA Gln CGA
Arg
CUG Leu CCG Pro
CAG Gln CGG
Arg
AUU Ile ACU Thr
AAU Asn AGU
Ser
AUC Ile ACC Thr
AAC Asn AGC
Ser
AUA Ile ACA Thr
AAA Lys AGA
Arg
AUG Met ACG Thr
AAG Lys AGG
Arg
GUU Val GCU Ala
GAU Asp GGU
Gly
GUC Val GCC Ala
GAC Asp GGC
Gly
GUA Val GCA Ala
GAA Glu GGA
Gly
GUG Val GCG Ala
GAG Glu GGG
Gly
Table I. The genetic code
Sequence
symbols: Nucleotides
Symbol Meaning Complement
A A T
C C G
G G C
T/U T A
M A or C K
R A or G Y
W A or T W
S C or G S
Y C or T R
K G or T M
V A or C or G B
H A or C or T D
D A or G or T H
B C or G or T V
X/N G or A or T or C X
. not G or A or T or C .
Deviations from the standard genetic code
# Cilian protozoa
UAA =
Gln:Q
UAG =
Gln:Q
# Yeast mitochondria
UGA =
Trp:W
CUU =
Thr:T
CUC =
Thr:T
CUA =
Thr:T
CUG =
Thr:T
AUA =
Met:M
# Mammalian mitochondria
UGA =
Trp:W
AUU =
Ile:I
AUC =
Ile:I
AUA =
Met:M
AGA = * :*
AGG = * :*
# Drosophila mitochondria
UGA =
Trp:W
AUU =
Ile:I
AUA =
Met:M
AGA =
Ser:S
AGG =
Ser:S
# mycoplasma
UGA = Trp
Codon usage for enteric bacterial (highly expressed) genes 7/19/83
AmAcid Codon Number
/1000 Fraction ..
Gly GGG 13.00 1.89 0.02
Gly GGA 3.00 0.44 0.00
Gly GGU 365.00 52.99 0.59
Gly GGC 238.00 34.55 0.38
Glu GAG 108.00 15.68 0.22
Glu GAA 394.00 57.20 0.78
Asp GAU 149.00 21.63 0.33
Asp GAC 298.00 43.26 0.67
Val GUG 93.00 13.50 0.16
Val GUA 146.00 21.20 0.26
Val GUU 289.00 41.96 0.51
Val GUC 38.00 5.52 0.07
Ala GCG 161.00 23.37 0.26
Ala GCA 173.00 25.12 0.28
Ala GCU 212.00 30.78 0.35
Ala GCC 62.00 9.00 0.10
Arg AGG 1.00 0.15 0.00
Arg AGA 0.00 0.00 0.00
Ser AGU 9.00 1.31 0.03
Ser AGC 71.00 10.31 0.20
Lys AAG 111.00 16.11 0.26
Lys AAA 320.00 46.46 0.74
Asn AAU 19.00 2.76 0.06
Asn AAC 274.00 39.78 0.94
Met AUG 170.00 24.68 1.00
Ile AUA 1.00 0.15 0.00
Ile AUU 70.00 10.16 0.17
Ile AUC 345.00 50.09 0.83
Thr ACG 25.00 3.63 0.07
Thr ACA 14.00 2.03 0.04
Thr ACU 130.00 18.87 0.35
Thr ACC 206.00 29.91 0.55
Trp UGG 55.00 7.98 1.00
End UGA 0.00 0.00 0.00
Cys UGU 22.00 3.19 0.49
Cys UGC 23.00 3.34 0.51
End UAG 0.00 0.00 0.00
End UAA 0.00 0.00 0.00
Tyr UAU 51.00 7.40 0.25
Tyr UAC 157.00 22.79 0.75
Leu UUG 18.00 2.61 0.03
Leu UUA 12.00 1.74 0.02
Phe UUU 51.00 7.40 0.24
Phe UUC 166.00 24.10 0.76
Ser UCG 14.00 2.03 0.04
Ser UCA 7.00 1.02 0.02
Ser UCU 120.00 17.42 0.34
Ser UCC 131.00 19.02 0.37
Arg CGG 1.00 0.15 0.00
Arg CGA 2.00 0.29 0.01
Arg CGU 290.00 42.10 0.74
Arg CGC 96.00 13.94 0.25
Gln CAG 233.00 33.83 0.86
Gln CAA 37.00 5.37 0.14
His CAU 18.00 2.61 0.17
His CAC 85.00 12.34 0.83
Leu CUG 480.00 69.69 0.83
Leu CUA 2.00 0.29 0.00
Leu CUU 25.00 3.63 0.04
Leu CUC 38.00 5.52 0.07
Pro CCG 190.00 27.58 0.77
Pro CCA 36.00 5.23 0.15
Pro CCU 19.00 2.76 0.08
Pro CCC 1.00 0.15 0.00
The SWISS-PROT
Protein Sequence Data Bank is a database of protein sequences produced
collaboratively by Amos Bairoch (University of Geneva) and the EBI. Itcontains
high-quality annotation, is non-redundant, and cross-referenced to many other
databases.
Release 35.0 of
SWISS-PROT contains 69'113 sequence entries, comprising 25'083'768 amino acids
abstracted from 59'101 references.
SWISS-PROT is
accompanied by TREMBL, a computer-annotated supplement to SWISS-PROT. TREMBL
contains the translations of all coding sequences (CDS) present in the EMBL
Nucleotide Sequence Database not yet integrated into SWISS-PROT. TREMBL can be
considered as a preliminary section of SWISS-PROT as all TREMBL entries have
been assigned SWISS-PROT accession numbers. TREMBL is split into two main sections;
SP-TREMBL and REM-TREMBL. SP-TREMBL (SWISS-PROT TREMBL) contains the entries
which should eventually be incorporated into SWISS-PROT.
Release 5 of
TREMBL is created from release 53 of the EMBL nucleotide sequence database and
contains 166'361 sequence entries, comprising 45'671'684 amino acids.
General Documents
·
Release Notes
·
Old SWISS-PROT release notes
·
Current TREMBL release
notes
·
Publications of the EBI
SWISS-PROT Team.
Swissprot entry
ID
PRIO_HUMAN STANDARD; PRT;
253 AA.
AC P04156;
DT 01-NOV-1986
(REL. 03, CREATED)
DT 01-NOV-1986
(REL. 03, LAST SEQUENCE UPDATE)
DT 01-NOV-1997
(REL. 35, LAST ANNOTATION UPDATE)
DE MAJOR PRION
PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).
GN PRNP.
OS HOMO SAPIENS
(HUMAN).
OC EUKARYOTA;
METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA;
PRIMATES.
RN [1]
RP SEQUENCE FROM
N.A.
RX MEDLINE;
86300093.
RA KRETZSCHMAR
H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H.,
RA PRUSINER
S.B., DEARMOND S.J.;
RL DNA 5:315-324(1986).
RN [2]
RP SEQUENCE OF
8-253 FROM N.A.
RX MEDLINE;
86261778.
RA LIAO Y.-C.J.,
LEBO R.V., CLAWSON G.A., SMUCKLER E.A.;
RL SCIENCE
233:364-367(1986).
RN [3]
RP VARIANT
AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150.
RX MEDLINE;
91160504.
RA TAGLIAVINI
F., PRELLI F., GHISO J., BUGIANI O., SERBAN D.,
RA PRUSINER
S.B., FARLOW M.R., GHETTI B., FRANGIONE B.;
RL EMBO J.
10:513-519(1991).
RN [4]
RP REVIEW ON
VARIANTS.
RX MEDLINE;
93372867.
RA PALMER M.S.,
COLLINGE J.;
RL HUM. MUTAT.
2:168-173(1993).
RN [5]
RP REVIEW ON
VARIANTS.
RX MEDLINE;
94029646.
RA PRUSINER
S.B.;
RL ARCH. NEUROL.
50:1129-1153(1993).
RN [6]
RP VARIANT GSS
LEU-102.
RX MEDLINE;
89159432.
RA HSIAO K.,
BAKER H.F., CROW T.J., POULTER M., OWEN F.,
RA TERWILLIGER
J.D., WESTAWAY D., OTT J., PURSINER S.B.;
RL NATURE
338:342-345(1989).
RN [7]
RP VARIANTS
LEU-102; VAL-117 AND VAL-129.
RX MEDLINE;
89392018.
RA DOH-URA K.,
TATEISHI J., SASAKI H., KITAMOTO T., SAKAKI Y.;
RL BIOCHEM.
BIOPHYS. RES. COMMUN. 163:974-979(1989).
RN [8]
RP VARIANT FFI
ASN-178.
RX MEDLINE;
92195483.
RA MEDORI R.,
MONTAGNA P., TRITSCHLER H.J., LEBLANC A., CORTELLI P.,
RA TINUPER P.,
LUGARESI E., GAMBETTI P.;
RL NEUROLOGY
42:669-670(1992).
RN [9]
RP VARIANT CJD
ASN-178.
RX MEDLINE;
91124933.
RA GOLDFARB
L.G., HALTIA M., BROWN P., NIETO A., KOVANEN J.,
RA MCCOMBIE
W.R., TRAPP S., GAJDUSEK D.C.;
RL LANCET
337:425-425(1991).
RN [10]
RP VARIANT CJD
LYS-200.
RX MEDLINE;
90355709.
RA GOLDFARB L.,
MITROVA E., BROWN P., TOH B.K., GAJDUSEK D.C.;
RL LANCET
336:514-515(1990).
RN [11]
RP VARIANT GSS
ARG-217.
RX MEDLINE;
93250977.
RA HSIAO K.,
DLOUHY S.R., FARLOW M.R., CASS C., DA COSTA M.,
RA CONNEALLY
P.M., HODES M.E., GHETTI B., PRUSINER S.B.;
RL NAT. GENET.
1:68-71(1992).
RN [12]
RP VARIANTS CJD
ILE-180 AND ARG-223.
RX MEDLINE;
93213314.
RA KITAMOTO T.,
OHTA M., DOH-URA K., HITOSHI S., TERAO Y., TATEISHI J.;
RL BIOCHEM.
BIOPHYS. RES. COMMUN. 191:709-714(1993).
RN [13]
RP VARIANT CJD
ILE-210.
RX MEDLINE;
94071412.
RA POCCHIARI M.,
SALVATORE M., CUTRUZZOLA F., GENUARDI M.,
RA ALLCATELLI
C.T., MASULLO C., MACCHI G., ALEMA G., GALGANI S., XI Y.G.,
RA PETRAROLI R.,
SILVESTRINI M.C., BRUNORI M.;
RL ANN. NEUROL.
34:802-807(1993).
RN [14]
RP VARIANT GSS
LEU-105.
RX MEDLINE;
94077414.
RA YAMADA M.,
ITOH Y., FUJIGASAKI H., NARUSE S., KANEKO K., KITAMOTO T.,
RA TATEISHI J.,
OTOMO E., HAYAKAWA M., TANAKA J., MATSUSHITA M.,
RA MIYATAKE T.;
RL NEUROLOGY
43:2723-2724(1993).
RN [15]
RP VARIANT GSS
LEU-105.
RX MEDLINE;
95213742.
RA ITOH Y.,
YAMADA M., HAYAKAWA M., SHOZAWA T., TANAKA J., MATSUSHITA M.,
RA KITAMOTO T.,
TATEISHI J., OTOMO E.;
RL J. NEUROL.
SCI. 127:77-86(1994).
RN [16]
RP VARIANT CJD
LYS-200.
RX MEDLINE;
94142912.
RA INOUE I.,
KITAMOTO T., DOH-URA K., SHII H., GOTO I., TATEISHI J.;
RL NEUROLOGY
44:299-301(1994).
RN [17]
RP VARIANT CJD
LYS-200.
RX MEDLINE;
94316708.
RA GABIZON R.,
ROSENMAN H., MEINER Z., KAHANA I., KAHANA E., SHUGART Y.,
RA OTT J.,
PRUSINER S.B.;
RL PHILOS.
TRANS. R. SOC. LOND., B, BIOL. SCI. 343:385-390(1994).
RN [18]
RP VARIANT GSS
LEU-102.
RX MEDLINE;
95303274.
RA YOUNG K.,
JONES C.K., PICCARDO P., LAZZARINI A., GOLBE L.I.,
RA ZIMMERMAN
T.R., DICKSON D.W., MCLACHLAN D.C., ST GEORGE-HYSLOP P.,
RA LENNOX A.;
RL NEUROLOGY
45:1127-1134(1995).
CC -!- FUNCTION:
THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE
CC HOST
GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.
CC -!- SUBUNIT:
PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED
CC
"RODS".
CC -!-
SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.
CC -!- DISEASE:
PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND
CC ANIMALS
INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS
CC
TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE:
CC
CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME
CC (GSS),
FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE
CC IN SHEEP
AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN
CC CATTLE;
TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING
CC DISEASE
(CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM
CC
ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY
CC (EUE) IN
NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE
CC THREE
MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)
CC SPORADIC
AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,
CC EUE ARE
ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED
CC
FOODSTUFFS.
CC -!- DISEASE:
CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PER
CC MILLION),
WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OF
CC CJD TO
HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTH
CC HORMONE
(HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHIC
CC ELECTRODE
IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TO
CC IMPLICATE
THE INGESTION OF INFECTED ANNIMAL MEAT IN THE
CC
PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURES
CC THAT
CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORM
CC
DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTEN
CC APPEARS
TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS,
CC AND (3)
AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BY
CC
PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS IN
CC MID-LIFE.
SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OF
CC HIGH
CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES.
CC THE
DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS.
CC -!- DISEASE:
GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS A
CC
"SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE
DEPOSITS".
CC GSS
INCIDENCE IS LESS THAN 2 PER 100 MILLION.
CC -!- DISEASE:
KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONG
CC NATIVES
OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUS
CC MOVEMENT
DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THE
CC LIMBS,
AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA IS
CC CONSPICUOUSLY
ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTH
CC AFTER
ONSET.
CC -!-
SIMILARITY: TO OTHER PRP.
CC -!- DATABASE:
NAME=HotMolecBase; NOTE=PrP entry;
CC
WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm".
DR EMBL; M13667;
G190470; -.
DR EMBL; M13899;
G190468; -.
DR EMBL; D00015;
G220016; -.
DR PIR; A05017;
A05017.
DR PIR; A24173;
A24173.
DR PIR; S14078;
S14078.
DR MIM; 176640;
-.
DR MIM; 123400;
-.
DR MIM; 137440;
-.
DR MIM; 245300;
-.
DR MIM; 600072;
-.
DR PROSITE;
PS00291; PRION_1; 1.
DR PROSITE;
PS00706; PRION_2; 1.
KW PRION; BRAIN;
GLYCOPROTEIN; GPI-ANCHOR; REPEAT; SIGNAL;
KW POLYMORPHISM;
DISEASE MUTATION.
FT SIGNAL 1
22
FT CHAIN 23
230 MAJOR PRION PROTEIN.
FT PROPEP 231
253 REMOVED IN MATURE FORM
(BY SIMILARITY).
FT LIPID 230
230 GPI-ANCHOR (BY
SIMILARITY).
FT CARBOHYD 181
181 PROBABLE.
FT CARBOHYD 197
197 PROBABLE.
FT DISULFID 179 214 BY SIMILARITY.
FT DOMAIN 51
91 5 X 8 AA TANDEM REPEATS
OF P-H-G-G-G-W-G-
FT Q.
FT REPEAT 51
59 1.
FT REPEAT 60
67 2.
FT REPEAT 68
75 3.
FT REPEAT 76
83 4.
FT REPEAT 84
91 5.
FT VARIANT 102
102 P -> L (IN GSS).
FT VARIANT 105
105 P -> L (IN GSS).
FT VARIANT 117
117 A -> V (LINKED TO
DEVELOPMENT OF
FT DEMENTING GSS).
FT VARIANT 129
129 M -> V (DETERMINES
THE DISEASE PHENOTYPE
FT IN PATIENTS WHO HAVE A PRP MUTATION
AT
FT CODON 178: PATIENTS WITH MET
DEVELOP FFI,
FT THOSE WITH VAL DEVELOP CJD).
FT VARIANT 178
178 D -> N (IN FFI AND
CJD).
FT VARIANT 180
180 V -> I (IN CJD).
FT VARIANT 198
198 F -> S (IN A
ATYPICAL FORM OF GSS WITH
FT NEUROFIBRILLARY TANGLES).
FT VARIANT 200
200 E -> K (IN CJD).
FT VARIANT 210
210 V -> I (IN CJD).
FT VARIANT 217
217 Q -> R (IN GSS WITH
NEUROFIBRILLARY
FT TANGLES).
FT VARIANT 232
232 M -> R (IN CJD).
FT CONFLICT 118
118 MISSING (IN REF. 2).
SQ SEQUENCE 253 AA;
27661 MW; FD5373AD CRC32;
MANLGCWMLV
LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP
HGGGWGQPHG
GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA
VVGGLGGYML
GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV
NITIKQHTVT
TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV
ILLISFLIFL
IVG
//
PROSITE entries.
Example I
ID ATP_GTP_A;
PATTERN.
AC PS00017;
DT APR-1990
(CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE).
DE
ATP/GTP-binding site motif A (P-loop).
PA
[AG]-x(4)-G-K-[ST].
CC
/TAXO-RANGE=ABEPV;
3D 1EFM; 1ETU;
1Q21; 2Q21; 4Q21; 5Q21; 6Q21;
DO PDOC00017;
Example II
ID ZINC_FINGER_C2H2; PATTERN.
AC PS00028;
DT APR-1990 (CREATED); JUN-1994 (DATA UPDATE);
NOV-1997 (INFO UPDATE).
DE Zinc finger, C2H2 type, domain.
PA C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.
NR /RELEASE=35,69113;
NR /TOTAL=1932(412); /POSITIVE=1891(372);
/UNKNOWN=6(6); /FALSE_POS=35(34);
NR /FALSE_NEG=3; /PARTIAL=1;
CC /TAXO-RANGE=??E?V; /MAX-REPEAT=37;
CC /SITE=1,zinc; /SITE=3,zinc; /SITE=7,zinc;
/SITE=9,zinc;
DR P21192, ACE2_YEAST, T; P07248, ADR1_YEAST,
T; P39413, AEF1_DROME, T;
DR Q00900, AGIE_RAT , T; P41696, AZF1_YEAST, T; Q01954, BASO_HUMAN, T;
DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE,
T; P55201, BR14_HUMAN, T;
DR Q01295, BRC1_DROME, T; Q01296, BRC2_DROME,
T; Q01293, BRC3_DROME, T;
DR P10069, BRLA_EMENI, T; Q01713,
BTEB_RAT , T; Q01522, CF23_DROME, T;
DR P20385, CF2_DROME , T; P19538, CID_DROME ,
T; Q05620, CREA_ASPNG, T;
DR Q01981, CREA_EMENI, T; Q08705, CTCF_CHICK,
T; P49711, CTCF_HUMAN, T;
DR P36197, DEFI_CHICK, T; P23792, DISC_DROME,
T; P26632, EGR1_BRARE, T;
DR P18146, EGR1_HUMAN, T; P08046, EGR1_MOUSE,
T; P08154, EGR1_RAT , T;
DR Q05159, EGR2_BRARE, T; P26633, EGR2_CRILO,
T; P26634, EGR2_DUSTH, T;
DR P11161, EGR2_HUMAN, T; P08152, EGR2_MOUSE,
T; P26635, EGR2_POERE, T;
DR P51774, EGR2_RAT , T; Q08427, EGR2_XENLA, T; Q06889, EGR3_HUMAN, T;
DR P43300, EGR3_MOUSE, T; P43301,
EGR3_RAT , T; Q05215, EGR4_HUMAN, T;
.
. ( I edited out a lot of very interesting
information here ...)
.
DR P49782, S3AE_BACSU, F; P24804, TA29_TOBAC,
F; P36810, VE6_HPV32 , F;
DR P15024, VL1_REOVD , F; P03527, VSI3_REOVD,
F; P30211, VSI3_REOVJ, F;
DR P07939, VSI3_REOVL, F; Q93098, WN8B_HUMAN,
F; P51028, WNT8_BRARE, F;
DR P28026, WNT8_XENLA, F; P20201, Y15K_SSV1 ,
F; P20198, Y5K6_SSV1 , F;
DR P43558, YFE4_YEAST, F; P37127, YFFG_ECOLI,
F; P38890, YH07_YEAST, F;
DR Q09441, YP83_CAEEL, F;
3D 1ARD; 1ARE; 1ARF; 1PAA; 1ZAA; 1AAY; 2GLI;
1SP1; 1SP2; 1NCS; 1ZFD; 1TF3;
3D 2DRP; 1ZNF; 3ZNF; 4ZNF; 1BBO; 5ZNF; 7ZNF;
DO PDOC00028;
{PDOC00004}
{PS00004;
CAMP_PHOSPHO_SITE}
{BEGIN}
****************************************************************
* cAMP-
and cGMP-dependent protein kinase phosphorylation site *
****************************************************************
There has
been a number of studies relative to the specificity of cAMP- and
cGMP-dependent
protein kinases [1,2,3]. Both types of
kinases appear to share
a
preference for the
phosphorylation of serine or
threonine residues found
close to
at least two consecutive
N-terminal basic residues. It is
important
to note
that there are quite a number of exceptions to this rule.
-Consensus
pattern: [RK](2)-x-[ST]
[S or T is the
phosphorylation site]
-Last
update: June 1988 / First entry.
[ 1]
Fremisco J.R., Glass D.B., Krebs E.G.
J. Biol. Chem. 255:4240-4245(1980).
[ 2]
Glass D.B., Smith S.B.
J. Biol. Chem. 258:14797-14803(1983).
[ 3]
Glass D.B., El-Maghrabi M.R., Pilkis S.J.
J. Biol. Chem. 261:2987-2993(1986).
{END}
Enzyme
www.bis.med.jhmi.edu/Dan/proteins/ec-enzyme.html
The current release has 3650 entries and was indexed 01-Mar-1998.
Description
The ENZYME data
bank contains the following data for each type of
characterized
enzyme for which an EC number has been provided: EC
number,
Recommended name, Alternative names, Catalytic activity,
Cofactors,
Pointers to the SWISS-PROT entrie(s) that correspond to the
enzyme, Pointers
to disease(s) associated with a deficiency of the enzyme.
Literature
Bairoch A. (1993)
The ENZYME data bank. Nucleic Acids Res, Jul
1;21(13):3155-6.
Taxonomy database
www3.ncbi.nlm.nih.gov/Taxonomy/tax.html
This is the top level of the taxonomy database maintained by
NCBI/GenBank. You can explore any of the taxa listed below by clicking it.
Archaea
Eubacteria
Eukaryotae
Viroids
Viruses
Other
Unclassified
This is a searchable index. You can enter the name of superspecific taxa
(e.g., Porifera) or
the name of a particular organism (e.g., Thalarctos maritimus for the polar
bear or polar
bear itself).
Query:
Use query string as : Complete
match Regular expression Set of tokens
The "Set of tokens" option returns longer names that include the
search terms, e.g., hybrid taxa.
See what happens if you query "Bos taurus" using the
"Complete match" option versus the "Set of
tokens" option.
These are direct links to some of the organisms most commonly used in
molecular research projects:
Arabidopsis thaliana
Caenorhabditis elegans
Danio rerio (zebrafish)
Drosophila
Escherichia coli
Hepatitis C virus
Homo sapiens
Mus musculus
Mycoplasma
Oryza sativa
Plasmodium falciparum
Pneumocystis carinii
Rattus
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Xenopus laevis
Molecular biology databases – text searches and retrieval of data.
With Internet tools such as
- Entrez
or
- Sequence Retrieval System
(SRS)
you can search the annotation section of molecular biology databases using
one or more words and by selecting specific fields in the database. For
instance to find all insulin proteins in a protein database you can search
Entrez – protein database , use ”insulin” as search word and selecting ”All
fields” or ”Protein Name”.
Tutorials for Entrez and SRS
http://www3.ncbi.nlm.nih.gov/Entrez/entrezhelp.html
http://srs.ebi.ac.uk:5000/srs5/man/srsman.html
Fields that may be specified in the nucleotide, protein and Pubmed databases of
Entrez:
Nucleotide
All Fields
Accession
Author Name
EC/RN Number
Feature key
Gene Name
Issue
Journal Name
Keyword
Modification date
Organism
Page Number
Primary Accession
Properties
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title Word
Volume
Sequence ID
Protein
All Fields
Accession
Author Name
EC/RN Number
Gene Name
Issue
Journal Name
Keyword
Modification date
Organism
Page Number
Primary Accession
Properties
Protein Name
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title Word
Volume
Sequence ID
Medline
All Fields
Affiliation
Author Name
EC/RN Number
Entrez Date
Issue
Journal Name
Language
MeSH Major Topic
MeSH Terms
Page
Publication Date
Publication Type
Subheading
Substance Name
Text Word
Title Word
Volume
MEDLINE ID
PubMed ID
SRS servers
EMBL, Heidelberg, Germany www.embl-heidelberg.de:80/srs5/ |
EMBL-EBI, Hinxton, Cambridge, UK srs.ebi.ac.uk:5000/srs5/ |
(http://srs.ebi.ac.uk:5000/srs5/man/srsman.html)
SRSWWW is a World
Wide Web interface to the Sequence Retrieval System (SRS). Compared with other
SRS interfaces SRSWWW has more users because of the widespread use of web
browsers, its easiness to handle and its user friendly interface. It supports
HTML3 and lesser versions of HTML. Since HTTP is stateless, a user ID is
created to keep the state for the whole session.
A SRS session is
started by clicking on the 'Start' button in the SRS home page. The page
appearing next is the 'Select Libraries Page' also referred to as 'Top Page'.
Most pages in the
SRSWWW have six header buttons which contain links to other SRSWWW pages. One
header button is the 'Top Page' which links to the start
page of the SRSWWW. The other buttons point to the 'Query
Form' page, the 'Query Manager' page, the 'View Manager' page, the 'Databanks'
page and the 'Help' page.
The 'Top Page'
is used to select the one or more databank(s) to be
searched. Selecting databank(s) can be done by clicking the check boxes. In
all cases, the databank names are linked to an information page ('Information
about ...' page). It provides some information about the indices of the
respective databank. The data fields of the databank can be browsed directly
(see the 'Browse Index' page). If you want to
change databases for another search you have to go to the 'Top Page'. |
|
The 'Reset'
button can be used to deselect all selections done so far. Each databank can
individually be deselected by clicking again their checkboxes. Clicking the
'Continue' button finishes the 'Top page' and brings up the 'Query Form'
page. |
|
|
|
In the 'Top
Page' all available databanks are collected in groups defined by SRS. |
|
There is a
possibility to include user defined databanks. For example with the search
engine FASTA or BLAST special databanks can be created. Before such a
databank can be selected it has to be created with a search (for instance with BLAST). An existing sequence can be inserted
in the 'Enter query sequence' field by "cut and paste". |
With the 'Query
Form' page the databank query can be defined. The selected databank(s) is/are
listed at the top of the page. Which fields are listed depends on the 'Show
only fields that selected databanks have in common ' check box from the Top Page and the searchable fields of the databank(s). |
|
The 'Reset'
button can be used to deselect all selections done so far. Each selection can
be deselected individually by clicking again the check box. Clicking on the
'Do Query' button starts the query and brings up the 'Query
Result' page. |
|
FIELD NAME, QUERY,
INCLUDE IN LIST & RETRIEVE SUBENTRIES columns |
The 'Field Name'
selector lists all data fields of the selected databank which are linked to
an information page. By clicking on "Info" after selecting a 'Field
Name' one moves to the 'Browse Index' page. The
'Query' column is the input field for the query. By using
the operators "&" (= AND), "!" (= NOT), "|"
(= OR) one can use more than one search expression in
every 'Query' input field. The 'Include fields in output'
column is a selection list for inclusion of the corresponding field in the
query result. |
Some fields can
have additional menus with "greater than" and "greater than or
equal to" and "less than" and "less than or equal
to" symbols (e.g., SeqLength) for their range selection. (e.g.,
seqLength 100 && seqLength < 500). |
|
If more than one
data field is to be searched the 'Query' input fields can be combined with
the boolean operators 'AND', 'OR' and 'NOT'. Optionally, a wildcard is
appended to every search string to enhance the possibility to find a
match. |
|
Additional
options are 'Entry list in chunks of' to define the number of entries shown
per page and 'Use view' to select a predefined view with which the query
result is shown (see the 'Create a new view' page
for more information on 'Views'). A new view can be defined before doing a
query. All the 'Views' which are applicable to the selected databanks are
listed in the 'Use view' box and one of them can be selected. Note: only the
'Views' are listed which can be used with the selected databank(s). The
default is 'Names Only' (i.e. without a special 'View'). |
From the 'Query
Form' page the 'Browse Index' page is opened by clicking on a 'Field Name'
(e.g., the 'ID' field). |
|
Submits a search
string typed in the open box marked with a wildcard (the wildcard can be used
in the string or deleted). The result of any matches (referred to as
"values") of the search string in the indicated 'Field Name' is
listed in the 'Values in the index of the ... field' page. In this list
'Values' can be selected and with the 'Make Query' button a query can be
started immediately with the selected 'Value(s)'. |
The 'Query
Result' page shows the result of a submitted query. The query string, the
number of "hits" and found entries with check boxes and included
fields (if selected in the 'Query Form' page or in the view definition) are
indicated. The found entries are listed by the searched databank and the
entry ID. Clicking on the entry opens the 'Single
Entry' page. In addition, entries can be selected with the checkboxes.
The 'Reset' button can be used to deselect all selections done so far. With
the 'Link', 'Receive' and 'View' buttons further manipulations can be
performed with the selected entries. |
|
With this button
the whole/selected result entries can be checked whether they are linked to
other databanks (see the 'Link Page' for more
information). |
|
Save Button |
With the
'Receive' button the selected entries can be downloaded. |
Launch desired
application for further analysis. With this button
the result entries can be viewed with predefined 'View' definitions. The
'View' definitions can be selected out of the 'View' box. |
|
The last line
are hypertext links for any additional chunk set(s). These links are only
present if the number of entries found is bigger than the number of the
defined chunk size. The chunk set numbers starts from 1 to "number of
hits" divided by "entry list chunk size" (selected in the
'Query Form' page). The user can jump to any additional chunkset by the
hypertext link. The current chunk set is shown within braces. The left
anglebracket hyperlink moves to the first chunkset whereas the right
anglebracket hyperlink moves to the last chunkset. |
Every entry
found and listed in the 'Query Result' page has a link to the complete entry
in the databank. By clicking onto a entry, the 'Single Entry' page comes up
and shows the complete entry. The complete entry can be saved or printed
(refer to the documentation of your web browser). |
From the 'Query
Result' page one can go to the 'Link' page by pressing the 'Link' button. The
'Link' page is used to find the links between the found entries (the set) and
the selected databanks. The user can choose one of three options to display
the different entries. The 'Link' page lists all the databanks like the top
page. The user should select the databanks which are different from the
databanks selected for the query search and should select one of three
options of the 'Find all entries' field. Clicking on the 'Continue' button
shows the 'Query Result' page. A link can also be done in the 'Query Manager' page. (For more
information about links
refer to the SRS manual) |
|
|
|
in the selected
DATABANKS which are linked to the current set |
the result are
entries out of the selected databanks, which are linked with the set. |
in the current SET
which are linked to all selected databanks |
the result are
entries out of the set, which have links to the selected databanks. |
in the current set
which are not linked to any of the selected databanks |
the result are
entries out of the set, which contain NO links to the selected databanks. |
The 'Query
Manager' has two functions: one is to store the queries done so far. The
queries are listed in a table. Every query is listed with a check box, the
query name in the form 'Qn' (e.g., Q1, Q2, ..), a type (e.g., 'query' ,
'link' ), the total number of entries found, the library name(s), the number
of entries for each library, the query expression (SRS query syntax)
and a comment. The user may add own commentary to the query. Users working in
HTML3 mode get a table with all these descriptors. In versions below HTML3
the same information is available in different style and in different
order. The second
function of the 'Query Manager' is to make further queries and links. There are
three functions to control the queries and three functions to do further
queries and links. |
|
|
|
Save |
With the 'Save'
function the selected queries can be downloaded. |
Delete |
The 'Delete'
function deletes selected queries. |
View |
The 'View'
function can be used to inspect a single query again or with a different
'View' (choose a 'View' from the 'Using the view' field). |
Link |
The 'Link'
function is exactly the same as in the 'Query Result' page, see detailed
information in the chapter 'The Query Result Page'. |
Combine |
With this
function it is possible to combine one or more queries with the logical
operators AND, OR or NOT. |
Expression |
With this
function the user can directly enter a query like "(Q1 & Q2) ! (Q3 |
Q4) PDB" (i.e. query 1 AND query 2 but NOT query 3 OR query 4 linked to
the PDB database). In contrast to the 'Combine' function it allows to build
more complex queries. For more
information in using logical
operators and link operators
see the SRS manual. |
The 'Query
Manager' offers two options to manipulate the 'Query Result' page: |
|
Entry list in
chunks of... & Using the view... |
The options
'Entry list in chunks of ' and 'Using the view' are already described in the
chapter about the 'Query Form' page. |
The queries done
so far will be lost after closing the SRS session. However, it is quite useful
sometimes (especially in the case of complex query runs) to store a useful
query set in a file (referred to as "history") though it can be
loaded and used again in a new SRS session. This task is
accomplished by activating one or more checkboxes on the bottom 'Save in
histories queries of type' line.
The extraction
of information from the databank(s) is done with a query created in the
'Query Form' page. In the 'View Manager' you can define how to look at the
query results. You can use predefined 'Views' or create your own 'Views'. You
can apply them on new created queries or already existing queries. A set of
'Views' can be defined before any query has been performed. But of course, a
'View' can only be used after a query has been created. But the
'View Manager' can do a lot more than only to provide a special service on
query sets as described above. The 'View Manager' works independently from
the 'Query Form' routine. Thus, you can start complex queries within the
'View Manager' to investigate different relations between databanks and/or
query sets. The following
options are offered in the 'View Manager' page: 'Create a new view', 'Edit
selected view' or 'Delete selected view'. The 'View Manager' lists all the
'Views' defined so far in a table with View Name, Format, Root Libs, Root
Fields, Leaf Libs and Leaf Fields. After doing a query on the selected
database(s) the user gets a set of entries as a result. 'Views' can help to
recognize whether these entries may have e.g., links to other databases.
'Views' can be configured to see the fields of interest from the set and from
the linked database entries. There exist already predefined 'Views' which can
be used directly or modified for your own purpose. |
|
Click on the
link 'independent View Editor' with the middle or right mouse button; choose
'Open in New Window' in the appearing context menu. A new browser window is
opened. By this way one can define several 'Views' at a time or edit a 'View'
in one window and use it at once in the 'Query Result' page in the second
window. |
In the 'Create A
New View Page' the available databanks are listed in two columns. The
databanks in the first column are called root databanks. The databanks
in the second column are called leaf databanks. The root databanks
refer to the databanks used for a particular query search. The leaf databank
refer to the databanks which will be linked by the result set of the query
search done with the root databanks. For example the user can define a view
with SwissProt as root databank and Prosite as leaf databank. In the 'View
Select' page the fields which have to be displayed from both databanks have
to be selected. Choose this as a current view and make a query on SwissProt
(which is the root databank). The result will be SwissProt entries and for
entries with a link to Prosite the Prosite entries. |
|
In the 'Name of
view' field you can assign a name to the new 'View'. Leaving this field blank
a 'View' name is created automatically, prefixed by the first root databank
name and a view number at the end (e.g., Swissprot_view1, Prosite_view5
etc.). |
|
With selecting
the 'Display view as table' checkbox the result is displayed in a table form
instead of a list form. |
|
With this
checkbox it is possible to place the short field names as the header of the
table columns. |
|
This checkbox is
selected by default. So only common fields of the root databanks are
listed. |
|
The 'Reset'
button can be used to deselect all selections done so far. Each field and
checkbox can individually be deselected by clicking again. Clicking on the
'Continue' button finishes the 'Create new view' page and brings up the 'View Field Select' page. |
In this page a
predefined 'View' can be edited. The 'Edit existing view' page offers the
same editing options as the 'Create a new view' page. |
Lists the fields
of all selected root and leaf databanks. Fields can be selected by their
checkboxes and are displayed in the query. The display format of the sequence
field can be selected by the format menu selection (e.g., FASTA, PIR, EMBL,
Protein Chart etc.). The leaf
databank selection has additional options: |
|
A query string
can be inserted to search for special relations between the root and leaf
databank. |
|
The result of
the 'View' can be shown by another 'View'. |
|
Displays only
the number of links which exist, not the links itself. |
|
To submit the
view one of the four options ('Top page', 'Query form', 'Query manager' or
'View manager') must be clicked. |
The 'Databanks'
page lists all available databases in a table form. Every databank is listed as
a hypertext link and shown with the release version number, the number of
entries, the indexing date, the group (e.g., HomolSearch, Sequence etc.) and an
availability flag. This flag indicates whether the respective databank is
available (ok) or not for searches.
Most of the
information about the databanks and their fields description are available
through hypertext links. Help is provided wherever possible in a context
sensitive way. In addition to providing information, the data fields can be
browsed directly (see the 'Browse Index' page).
Principles of sequence comparison
A common problem in
molecular biology is to compare one sequence to another sequence or to a set of
sequences. Previously we have searched databases for text information and
retrieved entries. Sequence comparison, however, is a less trivial problem and
in order to understand it a few theorical considerations are necessary.
Here we will consider
four kinds of sequence comparison:
-
Homology
-
Pattern
-
Multiple sequence alignment
-
Profile
1. Homology
FASTA and BLAST are
programs frequently used to search sequence databases for homology to a search
sequence. Programs of this kind answers questions like : Is my sequence similar
to anything in the database? Did I
finally clone the gene that I’ve been trying to clone for the last three years?
I seem to have identified a new protein, exactly what is the relationship of
this protein to proteins that have been described previously?
One principle to be kept
in mind that sequences are often searched as proteins although they are
determined as DNA sequence. If you have a DNA sequence you may find it natural
to search a DNA sequence database with this sequence. However, often it is more
sensible to use a protein sequence as search sequence. That is, if you think
your DNA sequence encodes a protein first you should try and identify all the
possible translation products of the DNA. Then use the protein sequences to
search a protein database (or a DNA
database where are possible reading frames are examined, a ”translated DNA
database”). Therefore, a common principle is that sequences are determined
as DNA but compared as proteins. The reasons for this are mainly:
-
The genetic code is degenerate. Consider for instance the two codons UCA
and AGU. Both encode serine but are completely unrelated with respect to their
sequence. Therefore, a sequence homology may be significant at
the protein sequence level, but not detectable at the DNA level.
-
All amino acid substitutions do not occur with the same frequency as
further described below (substitution matrices). This is a useful principle
when comparing proteins and there is no equivalent principle for nucleic acid
sequences.
It is important to search
nucleotide databases with proteins as query sequences because protein databases
are not updated quickly enough. Furthermore, DNA databases contain potential
protein coding regions that have not yet been identified. The only drawback
with searching DNA databases with a protein sequence as query is that it is
more time-consuming as all six reading frames have to be examined. TFASTA
and tblastn are programs that do this sort of analysis. An even more time-consuming type of search
is one where a DNA is used as query. All the six possible reading frames are
used as queries to search the six reading frames of a DNA database (tblastx
is one example of such a program, see below)
To understand how
programs like FASTA and BLAST work let’s
compare two related amino acid sequences:
seq 1 A L Q L I W G T S I R D K W P
G D L
seq 2 A L Q M I W A G G T S I R D P
G D L
We can represent them
graphically like this:
SEQ 2
A
*
L * * *
Q *
M
I * *
W *
A
*
G *
*
G *
*
T *
S *
I * *
R *
D *
*
P *
G *
*
D *
*
L * * *
A L Q L I W G T S I R D K W P G
D L SEQ 1
All positions of amino
acid identity are labeled by a ”*” (The GCG program COMPARE in combination with
DOTPLOT produces such a plot). Visual inspection reveals a number of diagonals.
Many computer programs designed to compare sequences identify such diagonals
and tries to combine them into a more extensive alignment. In our example:
seq 1 A L Q L I W - - G T S I R D K
W P G D L
| | | | |
| | | | | | | | | |
seq 2 A L Q M I W A G G T S I R D -
- P G D L
Thus, for a maximal
alignment we have created gaps in each sequence.
A program like FASTA
searches a database for sequence homology using a query sequence using the
principle outlined above. FASTA uses parameters such as:
ktup or wordsize
-
Length of initial peptide match, default is 2, i.e. the program starts
identifying a diagonal by extending a dipeptide match. You may prefer to use
ktup=1 if you want a more sensitive search. (This type of search takes a longer
time, though!)
Gap penalty
Gaps are rarely allowed
in evolution and a database search
program is slow when you instruct it to examine every possible diagonal . For
these reasons a gap penalty is assigned to gaps. These two parameters are
frequently used in homology programs (such as the GCG programs GAP, BESTFIT,
FASTA)
-
Gap weight (Number of gaps)
-
Gap length weight (Total length of gaps)
Substitution matrices
What is taken into accont
in the figure above is only whether there is an exact match or not.
However, we also have to into consideration
whether some matches or mismatches are better than others. To decide on this we
can assign a score to each amino acid
pair with a high score to matches or mismatches that frequently occur in
proteins and a low score to mismatches that seldom occur. To do this we make
use of a table (substitution matrix) that
looks something like this:
Substitution matrix (PAM250)
A R
N D C Q E
G H I L K
M F P S T W Y
V B Z X *
A 2 -2 0 0
-2 0
0 1 -1 -1 -2 -1 -1 -3 1
1 1 -6 -3 0
0 0 0 -8
R 6 0 -1 -4
1 -1 -3 2 -2 -3 3 0
-4 0
0 -1 2 -4 -2 -1 0 -1 -8
N 2 2 -4
1 1 0 2 -2 -3 1 -2 -3
0 1 0 -4 -2 -2 2 1 0
-8
D 4
-5 2
3 1 1 -2 -4 0 -3 -6 -1 0 0
-7 -4 -2 3 3 -1 -8
C 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8
0 -2 -4 -5 -3 -8
Q
4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8
E
4 0 1 -2 -3 0 -2 -5 -1 0 0
-7 -4 -2 3 3 -1 -8
G
5 -2 -3 -4 -2 -3 -5 0 1 0
-7 -5 -1 0 0 -1 -8
H 6 -2 -2
0 -2 -2 0 -1 -1 -3 0 -2
1 2 -1 -8
I 5
2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8
L 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8
K 5 0 -5 -1 0 0 -3 -4 -2
1 0 -1 -8
M 6 0 -2 -2 -1 -4 -2 2 -2 -2
-1 -8
F 9 -5 -3 -3 0 7
-1 -4 -5 -2 -8
P 6 1 0 -6 -5 -1 -1 0 -1 -8
S 2 1 -2 -3 -1
0 0 0 -8
T 3 -5 -3 0 0
-1 0 -8
W 17 0 -6 -5 -6 -4 -8
Y 10 -2 -3 -4
-2 -8
V 4 -2 -2
-1 -8
B
3 2 -1 -8
Z 3 -1 -8
X
-1 -8
*
1
The table was constructed
from an alignment of proteins, such as cytochrome c from a large number of
organisms. From each amino acid position in this alignment a probability is
calculated for all the possible amino acid replacements, for instance the
probability that serine replaces alanine. In nature that replacement is the
result of one or more mutations at the DNA level. The table of probabilities
was normalized (such that for every 100 amino acids an average of 1 mutation
was accepted --> accepted point mutation = PAM100). The PAM100 table was
calculated by comparing closely related sequences To estimate more distant
relationships a PAM250 was created.) Essentials : Postive values reflect
conservative changes, negative = changes that are unlikey to occur. W and C are
amino acids that are particularly conserved.
PAM tables have been
widely used. Recently however, BLOSUM (Blocks substitution matrices) tables
have become more popular as they often seem to give a somewhat better result.
BLOSUM tables were generated by comparing short conserved regions (BLOCKS) of
proteins. BLOSUM matrices seem to favour hydrophilic matches and aromatic
mismatches.
Low complexity masking
Filter out short repeats
For these parameters see
BLAST below.
BLAST
Speed is increased
compared to FASTA :
-
Wordsize = 3-4
-
No gaps are allowed.
BLAST is therefore less
sensitive than FASTA and less suitable for comparing nucleotide sequences.
BLAST programs
blastp
compares an amino acid query
sequence against a
protein sequence database;
blastn
compares a nucleotide query sequence
against a
nucleotide sequence database;
blastx
compares the six-frame
conceptual translation
products of
a nucleotide query
sequence (both
strands) against a protein
sequence database;
tblastn
compares a protein
query sequence against
a
nucleotide sequence database dynamically
translated in
all six reading
frames (both
strands).
tblastx
compares the six-frame translations of
a nucleo-
tide query sequence against the
six-frame transla-
tions of a nucleotide sequence
database.
Some BLAST parameters:
The statistical
significance threshold for reporting matches against database sequences; the
default value is 10, such that 10 matches are expected to be foundmerely by
chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to
a match is greater than the EXPECT threshold, the match will not be reported.
Lower EXPECT thresholds are more stringent, leadingto fewer chance matches
being reported. Fractional values are
acceptable. (See parameter E in the
BLAST Manual).
Cutoff score for
reporting high-scoring segment pairs.The default value is calculated from the
EXPECT value(see above). HSPs are
reported for a database sequenceonly if the statistical significance ascribed
to themis at least as high as would be ascribed to a loneHSP having a score
equal to the CUTOFF value. Higher
CUTOFF values are more stringent, leading to fewerchance matches being
reported. (See parameter S inthe BLAST
Manual). Typically, significance
thresholdscan be more intuitively managed using EXPECT.
MATRIX
Specify an
alternate scoring matrix for BLASTP, BLASTX,
TBLASTN and TBLASTX. The default
matrix is BLOSUM62 (Henikoff & Henikoff, 1992). The valid alternative choices include: PAM40, PAM120, PAM250 and IDENTITY. No alternate scoring matrices
are available for BLASTN; specifying the MATRIX directive in BLASTN requests
returns an error response.
Restrict a TBLASTN
search to just the top or bottom strand of the database sequences; or restrict
a BLASTN, BLASTX or TBLASTX search to just reading frames on the top or bottom
strand of the query sequence.
Mask off segments
of the query sequence that have low compositional complexity, as determined by
the
SEG program of
Wootton & Federhen (Computers and Chemistry, 1993), or segments consisting
of
short-periodicity
internal repeats, as determined by the XNU program of Claverie & States
(Computers
and Chemistry,
1993), or, for BLASTN, by the DUST
program of Tatusov and Lipman (in preparation).
Filtering can
eliminate statistically significant but
biologically uninteresting reports from the blast
output (e.g., hits
against common acidic-, basic- or proline-rich regions), leaving the more
biologically
interesting
regions of the query sequence available for specific matching against database
sequences.
Low complexity
sequence found by a filter program is substituted using the letter
"N" in nucleotide sequence
(e.g., "NNNNNNNNNNNNN") and the letter "X" in
protein sequences (e.g.,
"XXXXXXXXX"). Users may turn
off filtering by using the
"Filter" option on the "Advanced options for the BLAST server" page.
Filtering is only
applied to the query sequence (or its translation products), not to database
sequences. Default filtering is DUST
for BLASTN, SEG for other programs.
It is not unusual
for nothing at all to be masked by SEG, XNU, or both, when applied to sequences
in SWISS-PROT, so filtering should not be expected to always yield an
effect. Furthermore, in some cases,
sequences are masked in their entirety, indicating that the statistical
significance of any matches reported against the unfiltered query sequence
should be suspect.
Example of BLAST output:
BLASTP 1.4.6MP [13-Jun-94] [Build
13:58:36 Sep 22 1994]
Reference:
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.
215:403-10.
Query =
pir|A01243|DXCH 232 Gene X
protein - Chicken (fragment)
(232 letters)
Database:
SWISS-PROT Release 29.0
38,303 sequences; 13,464,008 total letters.
Searching..................................................done
Observed Numbers of Database Sequences Satisfying
Various EXPECTation Thresholds (E parameter values)
Histogram units:
= 31 Sequences : less than 31
sequences
EXPECTation Threshold
(E parameter)
|
V Observed Counts--
10000 4863 1861
|============================================================
6310 3002 782
|=========================
3980 2220 812
|==========================
2510 1408 303 |=========
1580 1105 393
|============
1000 712 179 |=====
631 533 161 |=====
398 372 80 |==
251 292 73 |==
158 219 50 |=
100 169 32 |=
63.1 137 18 |:
39.8 119 9 |:
25.1 110 6 |:
15.8 104 9 |:
Expect = 10.0, Observed = 95
<<<<<<<<<<<<<<<<
10.0 95 4 |:
6.31 91 3 |:
3.98 88 1 |:
2.51 87 3 |:
1.58 84 0 |
1.00 84 2 |:
Smallest
Sum
High Probability
Sequences producing High-scoring
Segment Pairs: Score P(N)
N
sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) (... 1191
7.7e-160 1
sp|P01014|OVAY_CHICK GENE Y
PROTEIN (OVALBUMIN-RELATED).
949 7.0e-127 1
sp|P01012|OVAL_CHICK OVALBUMIN
(PLAKALBUMIN). 645 3.4e-100
2
sp|P19104|OVAL_COTJA
OVALBUMIN. 626 1.2e-96
2
sp|P05619|ILEU_HORSE LEUKOCYTE
ELASTASE INHIBITOR (LEI). 216 3.7e-71
3
sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI)
(... 325 4.0e-71 2
sp|P29508|SCCA_HUMAN SQUAMOUS
CELL CARCINOMA ANTIGEN (SCC...
439 3.5e-70 2
sp|P30740|ILEU_HUMAN LEUKOCYTE
ELASTASE INHIBITOR (LEI) (... 211 1.3e-66
3
sp|P05120|PAI2_HUMAN PLASMINOGEN
ACTIVATOR INHIBITOR-2, P... 176 1.8e-65
4
sp|P35237|PTI_HUMAN PLACENTAL THROMBIN INHIBITOR. 473 1.3e-61 1
sp|P29524|PAI2_RAT PLASMINOGEN ACTIVATOR INHIBITOR-2,
T... 183 9.4e-61 4
sp|P12388|PAI2_MOUSE PLASMINOGEN
ACTIVATOR INHIBITOR-2, M... 179 1.8e-60
4
sp|P36952|MASP_HUMAN MASPIN
PRECURSOR.
198 2.6e-58 4
sp|P32261|ANT3_MOUSE
ANTITHROMBIN-III PRECURSOR (ATIII).
142 4.0e-48 5
sp|P01008|ANT3_HUMAN
ANTITHROMBIN-III PRECURSOR (ATIII).
122 7.5e-48 5
WARNING: Descriptions of 80 database sequences were
not reported due to the
limiting value of parameter V = 15.
... alignments with the top 8 database sequences deleted ...
sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL
(PAI-2)
(MONOCYTE ARG- SERPIN).
Length = 415
Score = 176 (80.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65
Identities = 38/89 (42%), Positives = 50/89 (56%)
Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN
60
+I +LL S D DT
+VLVNA+YFKG WKT F + PF V
+ PVQMM +
Sbjct: 180 KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSAQRTPVQMMYLRE
239
Query: 61 SFNVATLPAEKMKILELPFASGDLSMLVL 89
N+ + K +ILELP+A L+L
Sbjct: 240 KLNIGYIEDLKAQILELPYAGDVSMFLLL 268
Score = 165 (75.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65
Identities = 33/78 (42%), Positives = 47/78 (60%)
Query: 155 ANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFL
214
AN +G+S L
+S+ H A ++++E+G E A TG +
+ QF ADHPFLFL
Sbjct: 338 ANFSGMSERNDLFLSEVFHQAMVDVNEEGTEAAAGTGGVMTGRTGHGGPQFVADHPFLFL
397
Query: 215 IKHNPTNTIVYFGRYWSP 232
I H T I++FGR+ SP
Sbjct: 398 IMHKITKCILFFGRFCSP 415
Score = 144 (65.6 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65
Identities = 26/62 (41%), Positives = 41/62 (66%)
Query: 90 LPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTD
149
+ D + LE
+E I ++KL +WT+ + M + V+VY+PQ K+EE Y L S+L ++GM D
Sbjct: 272 IADVSTGLELLESEITYDKLNKWTSKDKMAEDEVEVYIPQFKLEEHYELRSILRSMGMED
331
Query: 150 LF 151
F
Sbjct: 332 AF 333
Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65
Identities = 10/17 (58%), Positives = 16/17 (94%)
Query: 81 SGDLSMLVLLPDEVSDL 97
+GD+SM +LLPDE++D+
Sbjct: 259 AGDVSMFLLLPDEIADV 275
WARNING: HSPs involving 86
database sequences were not reported due to the
limiting value of parameter B = 9.
FASTA
Two major parts of the
program :
-
Search in the database, the best 100-200 database sequences that best
matches the query are saved.
-
These sequences are carefully examined and the best possible alignments
with the query sequence is determined. This step yields an optimized score,
"opt". (FASTA optimizes only in the final step. However, it is
possible to look for optimal alignment of a query with every sequence in the
database. This is referred to as a Smith and Waterman search.)
If you use FASTA you will
most likely find it slower than BLAST.
An advantage with GCG FASTA is that you can search a subset of the EMBL
or Swissprot databases and in this way reduce the time considerably.
TFASTA is a variation of
the FASTA program that searches a nt database with protein query. Therefore,
TFASTA takes six times longer time to run than FASTA.
FASTA output
Histogram Key:
Each histogram symbol represents
128 search set sequences
Each inset symbol represents 1
search set sequences
Score Init1 Initn
(‑) (+)
< 2 5 5:=
4 0 0:
6 3 3:=
8 22 22:=
10 72 72:=
12 341 341:===
14 394 394:====
16 1304 1304:===========
18 1525 1525:============
20 4001
4001:================================
22 5729
5729:=============================================
24 6806
6806:======================================================
26 7626
7626:============================================================
28 6416
6416:===================================================
30 4848
4592:====================================‑‑
32 3710
3281:==========================‑‑‑
34 2457
2081:=================‑‑‑
36 1740 1441:============‑‑
38 1240 947:========‑‑
40 789 638:=====‑‑
42 517 483:====‑
44 292 496:===+
46 190 437:==++
48 124 387:=+++
50 81 316:=++
52 53 249:=+
54 39 182:=+
56 15 153:=+
58 13 119:=
60 8 91:=
62 20 74:=
64 2 36:=
66 4 33:=
68 1 21:=
70 1 15:=
72 0 20:+
74 0 10:+ :++++++++++
76 0 17:+ :+++++++++++++++++
78 1 5:= :=++++
80 0 2:+ :++
82 0 2:+ :++
84 0 3:+ :+++
86 0 4:+ :++++
88 0 1:+ :+
90 0 4:+ :++++
92 0 2:+ :++
94 0 1:+ :+
96 0 0: :
98 0 2:+ :++
100 0 0: :
>100 19 19:= :===================
The best scores are: init1 initn opt..
swissprotold:sr54_mycmy Q01442
mycoplasma mycoides. sign...2044
2044 2044
swnew:sr54_mycge P47294 mycoplasma
genitalium. signal re... 697 773 1038
swissprotold:sr54_bacsu P37105
bacillus subtilis. signal... 477
705 1023
swissprotold:sr54_ecoli P07019
escherichia coli. signal ... 343
588 892
swissprotold:sr54_haein P44518
haemophilus influenzae. s... 341
584 888
swissprotold:sr5c_arath P37107
arabidopsis thaliana (mou... 470
518 895
swissprotold:sr54_yeast P20424
saccharomyces cerevisiae ... 370
435 593
swissprotold:sr54_canfa P13624
canis familiaris (dog). s... 335
413 641
swissprotold:sr54_mouse P14576 mus
musculus (mouse). sig... 329 407 635
swissprotold:sr54_arath P37106
arabidopsis thaliana (mou... 343
379 607
swissprotold:sr54_schpo P21565
schizosaccharomyces pombe... 351
351 649
swissprotold:ftsy_haein P44870
haemophilus influenzae. c... 153
304 404
swnew:ftsy_mycge P47539 mycoplasma
genitalium. cell divi... 151 283 321
swissprotold:pila_neigo P14929
neisseria gonorrhoeae. pr... 176
257 388
swissprotold:dock_sulso P27414
sulfolobus solfataricus, ... 254
254 458
swissprotold:srpr_yeast P32916
saccharomyces cerevisiae ... 155
224 223
swissprotold:ftsy_ecoli P10121
escherichia coli. cell di... 121
208 407
swissprotold:srpr_human P08240 homo
sapiens (human). sig... 121 155 274
swissprotold:srpr_canfa P06625
canis familiaris (dog). s... 121
121 267
swissprotold:rest_human P30622 homo
sapiens (human). res... 63 98
79
swissprotold:srmb_ecoli P21507
escherichia coli. atp‑dep...
61 98 97
swissprotold:gyrb_mycpn P22447
mycoplasma pneumoniae. dn... 50 93
60
swissprotold:flhf_bacsu Q01960
bacillus subtilis. flagel... 77 92
168
swissprotold:hs7c_caeel P27420
caenorhabditis elegans. h... 48 92
53
swissprotold:gr78_rat P06761 rattus
norvegicus (rat). 78... 49 90
52
swissprotold:gr78_mesau P07823
mesocricetus auratus (gol... 49 90
52
swissprotold:gr78_human P11021 homo
sapiens (human). 78 ... 49 90
52
swissprotold:gr78_mouse P20029 mus
musculus (mouse). 78 ... 49 90
52
swissprotold:trj2_ecoli P05837
escherichia coli. traj pr... 47 88
54
swissprotold:rnha_human Q08211 homo
sapiens (human). atp... 49 86
59
swissprotold:rrpl_hrsva P28887
human respiratory syncyti... 49 85
57
swissprotold:ycf1_tobac P12222
nicotiana tabacum (common... 57 85
77
swissprotold:dnak_mycpa Q00488
mycobacterium paratubercu... 53 85
75
swissprotold:vl2_hpv49 P36762 human
papillomavirus type ... 37 84
55
swissprotold:dnak_mycle P19993
mycobacterium leprae. dna... 53 83
80
swissprotold:cse1_yeast P33307
saccharomyces cerevisiae ... 66 83
69
swissprotold:msap_plafw P04933
plasmodium falciparum (is... 45 82
77
swissprotold:pr16_yeast P15938
saccharomyces cerevisiae ... 45 81
56
swissprotold:utro_human P46939 homo
sapiens (human). utr... 41 80
90
swissprotold:kar3_yeast P17119
saccharomyces cerevisiae ... 43 79
88
ID SR54_MYCMY STANDARD; PRT; 447 AA.
AC Q01442;
DT 01‑JUL‑1993 (REL.
26, CREATED)
DT 01‑JUL‑1993 (REL.
26, LAST SEQUENCE UPDATE)
DT 01‑FEB‑1995 (REL.
31, LAST ANNOTATION UPDATE)
DE SIGNAL RECOGNITION PARTICLE
PROTEIN (FIFTY‑FOUR HOMOLOG). . . .
SCORES Init1: 2044 Initn:
2044 Opt: 2044
100.0% identity in 447 aa
overlap
10 20 30 40 50 60
sr54_m MGFGDFLSKRMQKSIEKNMKNSTLNEENIKETLKEIRLSLLEADVNIEAAKEIINNVKQK
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sr54_m MGFGDFLSKRMQKSIEKNMKNSTLNEENIKETLKEIRLSLLEADVNIEAAKEIINNVKQK
10 20 30 40 50 60
70 80 90 100 110 120
sr54_m ALGGYISEGASAHQQMIKIVHEELVNILGKENAPLDINKKPSVVMMVGLQGSGKTTTANK
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sr54_m ALGGYISEGASAHQQMIKIVHEELVNILGKENAPLDINKKPSVVMMVGLQGSGKTTTANK
70 80 90 100 110 120
130 140 150 160 170 180
sr54_m LAYLLNKKNKKKVLLVGLDIYRPGAIEQLVQLGQKTNTQVFEKGKQDPVKTAEQALEYAK
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sr54_m LAYLLNKKNKKKVLLVGLDIYRPGAIEQLVQLGQKTNTQVFEKGKQDPVKTAEQALEYAK
130 140 150 160 170 180
190 200 210 220 230 240
sr54_m ENNFDVVILDTAGRLQVDQVLMKELDNLKKKTSPNEILLVVDGMSGQEIINVTNEFNDKL
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sr54_m ENNFDVVILDTAGRLQVDQVLMKELDNLKKKTSPNEILLVVDGMSGQEIINVTNEFNDKL
190 200 210 220 230 240
.......
sr54_mycmy
swnew:sr54_mycge
ID SR54_MYCGE STANDARD; PRT; 446 AA.
AC P47294;
DT 01‑FEB‑1996 (REL.
33, CREATED)
DT 01‑FEB‑1996 (REL.
33, LAST SEQUENCE UPDATE)
DT 01‑FEB‑1996 (REL.
33, LAST ANNOTATION UPDATE)
DE SIGNAL RECOGNITION PARTICLE
PROTEIN (FIFTY‑FOUR HOMOLOG). . . .
SCORES Init1: 697 Initn: 773 Opt: 1038
44.7% identity in 445 aa
overlap
10 20 30 40 50 60
sr54_m MGFGDFLSKRMQKSIEKNMKNSTLNEENIKETLKEIRLSLLEADVNIEAAKEIINNVKQK
| ::||: : ::::|::::
|::|:::: :|||||::||:||||: ::|::|:::::|
sr54_m
MFKAMLSSIVMRTMQKKINAQTITEKDVELVLKEIRIALLDADVNLLVVKNFIKAIRDK
10 20 30 40 50
70 80 90 100 110 120
sr54_m ALGGYISEGASAHQQMIKIVHEELVNILGKENAPLDINKKPSVVMMVGLQGSGKTTTANK
::| |: |:: :: ::|::::||:|||:: |: |: :|:| :|||||||||||| :|
sr54_m TVGQTIEPGQDLQKSLLKTIKTELINILSQPNQELN‑EKRPLKIMMVGLQGSGKTTTCGK
PATTERN MATCHING
Homology searches is used
for finding related sequences and allows you to search with relative long
sequences. Pattern searches is used for finding short sequence patterns in a
single sequence, in a group of sequences or in the databases. For instance, the
pattern "GDSGGP" is typical of serine proteases. When a sequence or
database of sequences is analyzed with respect to this pattern the program only
decides whether there is an exact match or not. Consequently, in pattern
matching one is not concerned with gaps and substitution matrices as in
homology searches.
Patterns may be more
complex than the one shown above for serine proteases. The pattern
corresponding to motif A of the ATP/GTP-binding site, for instance is:
[AG]-x(4)-G-K-[ST]
which is to be interpreted :
One amino acid that can be either A or G, followed by any four amino acids,
followed by G and K and finally one amino acid which is either S or T.
For instance,
AGRCGGKT
GGLGGGKT
AGRSGGKS
AGAAGGKT
are all examples of
sequences that will match the pattern.
(The GCG program
FINDPATTERNS uses this type of search. The search pattern is your own or some
predefined like one in PROSITE. The program MOTIFS specifically search a
protein sequence or set of sequences for the PROSITE motifs. In the GCG program
package the prosite information is in a file "prosite.patterns". This
file may be retrieved by typing at the command line : % fetch prosite.patterns)
Pattern searches are very
useful when you encounter a protein and you have no information about its
structure or biochemical function. By searching with PROSITE motifs you may, it
you are lucky, obtain a clue as to the structure or function of the protein.
Examples of data in
PROSITE:
ID ATP_GTP_A;
PATTERN.
AC PS00017;
DT APR-1990
(CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE).
DE
ATP/GTP-binding site motif A (P-loop).
PA
[AG]-x(4)-G-K-[ST].
CC
/TAXO-RANGE=ABEPV;
3D 1EFM; 1ETU;
1Q21; 2Q21; 4Q21; 5Q21; 6Q21;
DO PDOC00017;
ID ZINC_FINGER_C2H2; PATTERN.
AC PS00028;
DT APR-1990 (CREATED); JUN-1994 (DATA UPDATE);
NOV-1997 (INFO UPDATE).
DE Zinc finger, C2H2 type, domain.
PA
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.
NR /RELEASE=35,69113;
NR /TOTAL=1932(412); /POSITIVE=1891(372);
/UNKNOWN=6(6); /FALSE_POS=35(34);
NR /FALSE_NEG=3; /PARTIAL=1;
CC /TAXO-RANGE=??E?V; /MAX-REPEAT=37;
CC /SITE=1,zinc; /SITE=3,zinc; /SITE=7,zinc;
/SITE=9,zinc;
DR P21192, ACE2_YEAST, T; P07248, ADR1_YEAST,
T; P39413, AEF1_DROME, T;
DR Q00900, AGIE_RAT , T; P41696, AZF1_YEAST, T; Q01954, BASO_HUMAN, T;
DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE,
T; P55201, BR14_HUMAN, T;
DR Q01295, BRC1_DROME, T; Q01296, BRC2_DROME,
T; Q01293, BRC3_DROME, T;
DR P10069, BRLA_EMENI, T; Q01713,
BTEB_RAT , T; Q01522, CF23_DROME, T;
DR P20385, CF2_DROME , T; P19538, CID_DROME ,
T; Q05620, CREA_ASPNG, T;
DR Q01981, CREA_EMENI, T; Q08705, CTCF_CHICK,
T; P49711, CTCF_HUMAN, T;
DR P36197, DEFI_CHICK, T; P23792, DISC_DROME,
T; P26632, EGR1_BRARE, T;
DR P18146, EGR1_HUMAN, T; P08046, EGR1_MOUSE,
T; P08154, EGR1_RAT , T;
DR Q05159, EGR2_BRARE, T; P26633, EGR2_CRILO,
T; P26634, EGR2_DUSTH, T;
DR P11161, EGR2_HUMAN, T; P08152, EGR2_MOUSE,
T; P26635, EGR2_POERE, T;
.
. (Edited
here to save a few Swedish trees...)
.
DR P07939, VSI3_REOVL, F; Q93098, WN8B_HUMAN,
F; P51028, WNT8_BRARE, F;
DR P28026, WNT8_XENLA, F; P20201, Y15K_SSV1 ,
F; P20198, Y5K6_SSV1 , F;
DR P43558, YFE4_YEAST, F; P37127, YFFG_ECOLI,
F; P38890, YH07_YEAST, F;
DR Q09441, YP83_CAEEL, F;
3D 1ARD; 1ARE; 1ARF; 1PAA; 1ZAA; 1AAY; 2GLI;
1SP1; 1SP2; 1NCS; 1ZFD; 1TF3;
3D 2DRP; 1ZNF; 3ZNF; 4ZNF; 1BBO; 5ZNF; 7ZNF;
DO PDOC00028;
Multiple sequence
alignment
The homology search
programs FASTA and BLAST both rely on a basic procedure to compare two
sequences with each other. Multiple sequence alignment programs, on the other
hand, allows you to align and directly compare more than two related sequences. This procedure is a very useful tool if you
want to analyze a family of proteins and for instance to identify the
structural elements that are characteristic of that family. A common program
for multiple sequence analysis is CLUSTALW, or in the GCG package PILEUP.
PILEUP does an alignment
with many of the considerations as for programs like FASTA and BLAST. However,
it is not possible to make such a rigorous comparison of many sequences as when
only two sequences are compared.
Instead, "Pileup" and "Clustalw" performs a pairwise
comparison.
Let's look at the result
of a PILEUP program:
PileUp of:
@../kurs/l5/evol.list
Symbol comparison table: GenRunData:pileuppep.cmp CompCheck: 1254
GapWeight: 3.000
GapLengthWeight: 0.100
evol.msf MSF: 650 Type: P
July 24, 1995 17:08 Check: 7557
..
Name: HS71_HUMAN
Len: 650 Check: 2160
Weight: 1.00
Name: HS71_MOUSE
Len: 650 Check: 3217
Weight: 1.00
Name: HS72_MOUSE
Len: 650 Check:
383 Weight: 1.00
Name: HS70_CHICK
Len: 650 Check: 1880
Weight: 1.00
Name: HS72_YEAST
Len: 650 Check: 3179
Weight: 1.00
Name: HS71_YEAST
Len: 650 Check: 5547
Weight: 1.00
Name: HS74_YEAST
Len: 650 Check: 1191
Weight: 1.00
//
1 50
HS71_HUMAN .MAKAAAIGI DLGTTYSCVG VFQHGKVEII ANDQGNRTTP
SYVAFTDTER
HS71_MOUSE .MAKNTAIGI DLGTTYSCVG VFQHGKVEII ANDQGNRTTP
SYVAFTDTER
HS72_MOUSE MSARGPAIGI DLGTTYSCVG VFQHGKVEII ANDQGNRTTP
SYVAFTDTER
HS70_CHICK MSGKGPAIGI DLGTTYSCVG VFQHGKVEII ANDQGNRTTP
SYVAFTDTER
HS72_YEAST ....SKAVGI DLGTTYSCVA HFSNDRVDII ANDQGNRTTP
SFVGFTDTER
HS71_YEAST ....SKAVGI DLGTTYSCVA HFANDRVDII ANDQGNRTTP
SFVAFTDTER
HS74_YEAST ....SKAVGI DLGTTYSCVA HFANDRVEII ANDQGNRTTP
SYVAFTDTER
---cut to save Swedish trees
---
601 650
HS71_HUMAN ELEQVCNPII SGLYQGAGGP GPGGFGAQ.. ...G.PKGGS
GSGPTIEEVD
HS71_MOUSE ELERVCSPII SGLYQGAGAP GAGGFGAQ.. ...APPKGAS
GSGPTIEEVD
HS72_MOUSE ELERVCNPII SKLYQG.GPG GGG....... .........S
SGGPTIEEVD
HS70_CHICK ELEKLCNPIV TKLYQGAGGA GAG....... .........G
SGGPTIEEVD
HS72_YEAST ELQEVANPIM SKLYQAGGAP EGA...APGG FPGGAPPAPE
AEGPTVEEVD
HS71_YEAST ELQDIANPIM SKLYQAGGAP GGAAGGAPGG FPGGAPPAPE
AEGPTVEEVD
HS74_YEAST ELEGVANPIM SKFY...GAA GGAPGAGPVP GAGAGPTGAP
DNGPTVEEVD
The graphical output file from
PILEUP displays a similarity tree that illustrates the similarity between the
sequences. This gives a good overview about how the sequences are related to
each other. The length of the vertical lines is directly proportional to the
difference between the sequences (or sequence groups). By looking at the figure
below you may estimate that the two sequences 2 and 3 (from left) are the most
similar and they are about twice as similar as 6 and 7 are to each other.
By looking at this tree it's rather
easy to understand how "Pileup" works. The program creates the tree
by pairing the most similar sequences (or clusters of previously paired
sequences) step by step. It first selects the two sequences that are most
similar (in this case sequence 2 and 3 from the left) and group these two
together. It then takes the next two sequences (or groups) that are most
similar, in this case sequence 6 and 7. The next step is to pair 4 and 5. Now
it pairs the group 4,5 with the group 6,7.
Then sequence 1 to the group 2,3. Finally the two large groups are
joined.
Note that the tree is a similarity
tree. You may be fooled that this is is a phylogenetic tree, but it's not. However, this figure is very informative,
yet simple and easy to interpret.
Profile analysis
Profile analysis is a method of
sequence comparison which is distinct from homology and pattern searches. The starting point is a muliple sequence
alignment. (However, that sequence alignment is not necessarily the result of a
sequence alignment program like PILEUP but could also be an alignment based on
three dimensional structures!) The
alignment is used to create a PROFILE. The profile is a table where we
find for each amino acid position the frequency (somehow normalized) of each of
the 20 amino acids.
Below is the beginning of
a cut version of the output file. The sequence is on the vertical axis, one
aminoacid per line. Each line then contains scores against all aminoacids and,
at the end, for a gap at that position.
Like this:
(Amino acids M - T
have been edited out in this table)
Cons A
B C D E F
G H I K --
V W Y Z Gap
Len ..
M
0 -4 -8 -5 -3
7 -4 -4 8 3
-- 8 -4 -1 -1
24 24
S
6 0 1 -3 0
3 4 -7 7 6
-- 7 0 -7 -1
24 24
A
38 8 8 11 10
-15 25 -3
-2 -1 -- 6 -25
-11 7 24
24
K
-2 10 -16 7 7
-19 -4 6
-6 39 -- -6 11
-17 10 24
24
S
30 23 19 21 17 -21 41
-5 -8 6 -- 0
-14 -21 8
100 100
K
19 13 -9 11 11
-31 9 4
-6 37 -- -1 -18
-28 14 100
100
A
150 20 30
30 30 -50 70 -10
0 0 -- 20 -80
-30 20 100
100
I
5 -11 11 -11 -11
28 -5 -17 75 -11
-- 71 -35 1 -11
100 100
G
70 60 20 70 50
-60 150 -20
-30 -10 --
20 -100 -70 30
100 100
I
0 -20 20 -20 -20
70 -30 -30 150 -20
-- 110 -50 10 -20
100 100
D
30 110 -50
150 100 -100 70
40 -20 30 -- -20 -110
-50 90 100
100
L
-10 -50 -80
-50 -30 120
-50 -20 80
-30 -- 80
50 30 -20 100 100
The GCG program
"Profilemake" creates the profile. The next step is to use the
profile to search a database. The GCG program to use is PROFILESEARCH. In this
way new members of the protein family may be identified.
Profile searching can be effective
for specific structural motifs that cannot be identified by conventional
homology searching. One very illuminating example is the helix-turn-helix (HTH)
motif. A search with a profile created from the HTH motif very effectively
identifies such motifs in the Swissprot protein database. In contrast, using
BLAST or FASTA with a
helix-turn-helix-containing protein as query sequence is useless in
identifying members of this family of proteins.
GCG - an introduction
After you have logged in
to gcg.mednet.gu.se you initiate the GCG package version 9 by the command:
% gcg9
The following text
appears on your screen:
Welcome to the WISCONSIN
PACKAGE
Version 9.0-UNIX,
December 1996
Installed on irix
Copyright 1982-1996, Genetics Computer
Group, Inc. All rights reserved.
Published research assisted by this
software should cite:
Wisconsin Package Version 9.0, Genetics
Computer Group (GCG), Madison, Wisc.
Databases available:
EMBL Release 52.0 (10/97)
SWISS-PROT Release 34.0 ( 9/96)
PROSITE Release 14.0 ( 9/96)
Restriction Enzymes (REBASE) (10/96)
em_new (EMBL since rel.52.0) was
updated 02 March 1998
sw_new (SWISSPROT since rel.34.0) was
updated 31 January 1997
Online help: % genhelp or
http://www.gcg.com/genhelp/
The GCG program package
can be operated in two different ways:
- Command line
- Graphic interface SeqLab
In the command line mode
all programs are operated by commands issued at the command prompt. For
instance to translate the sequence "test.seq" you would have to type
the following at the command prompt :
% translate test.seq
Here we will focus on the
graphic environment provided by SeqLab. To start this interface:
% seqlab &
The symbol '&' is a
UNIX function that means that the program is carried out "in the
background". This means that at any point you may return from a SEQLAB
window to the command line control without closing the SEQLAB.
(For the seqlab to
function it is essential that you have software in your local computer to
emulate X windows and that you have performed a command "setenv DISPLAY
xx.yy.zz.aa:0" where "xx.yy.zz.aa" is the Internet name of your
computer, such as "laban.medkem.gu.se". It is suitable to have this
command a part of your login procedure, i.e in you file .cshrc).
If you want to have a
smaller or larger X window you may instead use
% seqlab -small &
and
% seqlab -large &
respectively.
Click "OK" on
the window in front.
You are now left with the
main SEQLAB window "The SeqLab Main Window on ...".
There is a GCG tutorial
on the SeqLab Interface which is a very good introduction to the environment.
Getting help
In the SEQLAB on-line
help is always available. For instance, in the main window click on
"Help".
In the command line mode
the same help information is available. To access online help organized by
program name , type:
% genhelp
To access online help
organized in sections like the Program Manual, type:
% genmanual
General information about
the Genetics Computer Group is found at
www.gcg.com and a help desk is available at the email address help@gcg.com
Seqlab - short
introduction
There are two main modes
of operation in the SeqLab window:
·
Main list
·
Editor
The default mode is Main
list. Select the Editor mode where you are about to edit a sequence or a
multiple sequence alignment. All programs are carried out in the following way.
Click with the mouse on the sequence (or sequences) that you want to analyze.
Then go to the Functions menu. As you select the program you want to run a
dialog box specific to that program shows up. If you accept the default
settings for the program you simply click on the Run button to start the
program.
To see the result of the
program you must go to Windows - Output
manager . Double click on the result file to see its contents. If the result is
a list file you can add it to your Main list to use it for instance as input to
another program.
If you think you have
started a program but cannot find the result in the Output window you might
find it useful to go the Job manager window (Also under the Windows menu) .
From the list in Job Manager you can for instance see whether a program is
running or if an error occurred during its execution (oops!).
Also under the Windows
menu is the "Database browser" that allows to browse the EMBL and
Swissprot databases in the GCG environment.
Types of sequence files in the
GCG environment
Recommended reading: Users's guide (UNIX)
Wisconsin Package
·
Database sequnces
·
Single
sequence files
·
List files
·
Multiple
sequence format files
Database sequences
Specifying database sequences in the GCG
environment:
Database Valid
names
____________________________________________________________
All DNA sequences em
embl
Bacterial bacterial pro
bact ba
EST est
Fungi fungi fun
Invertebrates in inv
Other Mammalian om
Organelle organelle org
or
Other Vertebrate ov
Patent patent pat
Phage phage ph
Plant pl plant
Primate primate pr
pri
Rodent rodent ro
rod
STS sts
Synthetic synthetic
syn
Unannotated unannotated
un
Viral viral
vir
EMBL‑New,all
divisions
(since latest
release) em_new
Swiss‑Prot swissprot sw
Swiss‑Prot
(since latest
release) swissprotnew
A specific sequence in
these databases may be accessed using the syntax database:entry, where entry
is the AC or ID.
Examples:
em_ba:BA09230
pro:BA09230
pro:U09230
em:U09230
bact:U09230
all refer to the same
sequence where U09230 is the accession code and BA09230 is the ID.
Swissprot entries are
referred to in the same way: sw:ftsy_ecoli and sw:p10121 is the
same sequence (ID and AC respectively). The last five letters of the Swissprot
ID is an abbreviation for the organism.
It is also possible to
use wildcards to specify a set of sequences. Examples:
em:* the
entire EMBL database
ba:* the
bacterial section of EMBL
em:a* all entries of embl beginning with "a".
sw:* all swissprot entries
sw:*ecoli all E. coli
sequences of Swissprot
sw:*human all human sequences of Swissprot.
GCG sequence files
GCG has its own format of
the sequence files. Here is a simple example of a sequence file:
This is just a
test sequence
test.seq
Length: 59 June 22, 1994
19:00 Type: N Check: 9968 ..
1
atctagcacg cgtgcccgat gcatgctgca cgatcgatcg tgtagtcgat
51
gctagctag
At the top is a comment
line. After that comes a GCG specific line containing the name of the sequence,
the length, the date, the type of sequence (N=nucleotide, P=protein) and a
checksum. The purpose of the checksum is to ensure that the sequence is correct
and not corrupted. You can NOT use a standard text editor to edit a GCG
sequence file. You have to use the GCG editor programs to ensure that you get
the proper format and checksum.
List files
An important GCG concept
is "list files". A list file is a textfile containing a list of
sequence files. Here is a simple example:
..
first.seq
second.seq
third.seq
The file begins with two
dots. Anything above this line is treated as a comment. After the two dots
follows a list of sequence files, one per line. It doesn't have to be sequence
files located in your local directory, but it may be sequences located
elsewhere, database entries and even other listfiles. Here is a more
complicated example:
This is an example of a list file
..
myown.seq sequence
file in this directory
/u/medkem/short.seq sequence file in
another directory
inv:ctbr6 database
sequence
@long.listfile another list file
Or this short and simple,
yet very useful list file:
Mammals.listfile - allows me to search ALL
mammal sequences in EMBL
at the same time.
..
hum:*
rod:*
mam:*
List files are very
efficient when working with several sequences at the same time. Several GCG
programs creates list files as output. For instance, FASTA produces a list file
that can be used as input to PILEUP to generate a multiple sequence alignment
from the sequences in the FASATA output file. You may also create your own list
files using a text editor.
In a GCG program you may
refer to all sequences listed in a list file by using the "@"-sign.
As an example look at the following question in stringsearch:
STRINGSEARCH through what sequence(s) (*
GenEMBL *) ? @long.listfile
The answer in this case
means that you want to search inside all sequences listed in the list file
called "long.listfile".
Multiple sequence format
files
Multiple sequences alignments like those created by PILEUP are
stored in a specific format (multiple sequence format, MSF). An example of such
a file is:
..
Name: rmet Len: 1410 Check:
701 Weight: 1.00
Name: mmet Len: 1410 Check: 9020
Weight: 1.00
Name: hmet Len: 1410 Check: 7912
Weight: 1.00
//
1 50
rmet MKAPTALAPG ILLLLLTLAQ RSHGECKEAL VKSEMNVNMK
YQLPNFTAET
mmet MKAPTVLAPG ILVLLLSLVQ RSHGECKEAL VKSEMNVNMK
YQLPNFTAET
hmet MKAPAVLAPG ILVLLFTLVQ RSNGECKEAL AKSEMNVNMK
YQLPNFTAET
51 100
rmet PIHNVVLPGH HIYLGATNYI YVLNDKDLQK VSEFKTGPVV
EHPDCFPCQD
mmet
PIQNVVLHGH HIYLGATNYI YVLNDKDLQK VSEFKTGPVL EHPDCLPCRD
.
.
.
.
1301 1350
rmet APPYPDVNTF DITIYLLQGR RLLQPEYCPD ALYEVMLKCW
HPKAEMRPSF
mmet APPYPDVNTF DITIYLLQGR RLLQPEYCPD ALYEVMLKCW
HPKAEMRPSF
hmet APPYPDVNTF DITVYLLQGR RLLQPEYCPD PYEVMLKCW
HPKAEMRPSF
1351 1400
rmet SELVSRISSI FSTFIGEHYV HVNATYVNVK CVAPYPSLLP
SQDNIDGEAN
mmet SELVSRISSI FSTFIGEHYV HVNATYVNVK CVAPYPSLLP
SDNIDGEGN
hmet SELVSRISAI FSTFIGEHYV HVNATYVNVK CVAPYPSLLS
SEDNADDEVD
1401
rmet T.........
mmet T.........
hmet
TRPASFWETS
Managing the GCG environment - FTP (File transfer protocol). Moving files between
the personal computer and the GCG environment.
Transfer of files from
one computer to another is often done with FTP (file transfer protocol).
Different FTP programs
are available. Examples are CuteFTP (PC) and Fetch (Mac). FTP is also possible
with a web browser like Netscape. When you transfer a sequence from your PC or
Mac to the GCG computer, remember that the sequence should be in a text format.
For instance if you have used Word to produce a document with a sequence, it is
important to save that document as a Text file and not in the standard (binary)
Word format. (Save as ... Select Text
format).
There are two forms of
FTP:
- Anonymous FTP
User anonymous
Password
[your email address]
Example with Netscape: ftp://ftp.gu.se
- Personal FTP
User
[your account username]
Password
[password associated with the username]
Example with Netscape:
ftp://kurs1@gcg.mednet.gu.se (Enter password)
Upload a file with Netscape : Select File -
Upload
FTP is also possible
using command line from a PC or from a UNIX environment.
Some commands (command
line)
! escape to the shell
ascii set ascii transfer type
binary set binary transfer type
bye terminate ftp session and exit
cd change remote working directory
cdup change remote working directory to
parent directory
dir list contents of remote directory
get receive file
image set binary transfer type
ls list contents of remote
directory
idle get (set) idle timer on remote
side
mget get multiple files
mkdir make directory on the remote
machine
mput send multiple files
prompt force interactive prompting on
multiple commands
put send one file
pwd print working directory on remote
machine
quit terminate ftp session and exit
Managing the GCG
environment - Some useful commands in UNIX
What files are there?
In your local directory
you have a number of files, for instance sequences that you use with the GCG
package. To list all files in your directory:
% ls
The option -l gives more
information about the files and the parameter -a lists also those files that
has a dot as the first character.
% ls
-al
A wildcard (*) may be
used to list a number of files with characters in common. For instance the
command below lists all files with the extension "seq":
% ls
*.seq
To copy a file
Sometimes you want to
make a copy of a file. For instance,
you may want to store a result of a GCG program under a specific name so that
the information it is not accidentally written over by a new run of the same
program.
For instance :
% cp test.seq
test2.seq
To check the result of
this operation use the command:
% ls test*
To erase a file
To delete a file use the
command rm, for instance
% rm test2.seq
will delete the file
"test2.seq". Again, use ls to check the result.
To make a new directory
Sooner or later your
working directory gets crowded with files and you may want to organize your
work. You create subdirectories with mkdir. For instance,
% mkdir mydir
will create the directory
"mydir".
To move to this directory
from your home directory type
% cd mydir
To go back to the
directory higher up:
% cd ..
(The directory
"mydir" may be removed by "rmdir mydir")
If you are lost and do
not know in what directory you are the command
% pwd
will present the current
directory.
The command
% cd
will always return you to
your home directory, i.e the directory you are in when you have logged in to
the computer.
If you want to rename a
file or move it to another directory, use the command "mv". For
instance
% mv test.seq test2.seq
will rename the file
"test.seq" to "test2.seq".
% mv test.seq mydir
will move the file
"test.seq" to the subdirectory "mydir"
To look at the contents
of a file
The commands cat
and less will list the contents of a text file. For instance
% cat test.seq
will present the contents
of "test.seq", a nucleotide sequence in GCG format. If you want to
examine a larger file it may be more convenient to use the command less. This
command will list the contents just as cat, but one page at a time. Use
the SPACE key to move to the next page, "u" to move backwards, and
"q" to quit.
% less test.seq
File editing
Different editors are available
in Unix. Examples av text editors that run from a simple text terminal are
"vi", and "emacs". Examples of graphical text editors that
are based on X windows are "jot" and "nedit". The later
editors are very easy to use but require an X windows environment.
Do the following exercise
with the editor "nedit"
% nedit &
In the resulting window
type from the keyboard a short nucleotide sequence. Save the file under a name,
for instance "min.seq". Exit "nedit" (Select File -
Quit) ( You may use % cat min.seq to
see the result of the file editing.)
To make this sequence in
a format used by the GCG programs type:
% reformat min.seq
(You must have issued the
% gcg command before using "Reformat" as it is a GCG program
and not a regular UNIX command. Type % cat
min.seq to see the result of the reformat procedure)
Changing the password
% passwd
Answer the questions:
Old password:
New password:
Retype new password:
Logging out
To end a UNIX session:
% logout
or
% exit
For more information on
UNIX commands see the GCG UNIX manual.
Overview of GCG
programs
COMPARISON
DATABASE_SEARCHING
DISPLAY
EDITING
EVOLUTIONARY_ANALYSIS
FILE_UTILITIES
FRAGMENT_ASSEMBLY
MANIPULATION
MAPPING
MISCELLANEOUS
MULTIPLE_SEQUENCE_ANALYSIS
PATTERN_RECOGNITION
PROTEIN_ANALYSIS
RNA_SECONDARY_STRUCTURE
SEQUENCE_EXCHANGE
TRANSLATION
Database
searching
You can search through the nucleic acid or protein sequence databases
for sequences similar to your sequence or sequence pattern. You can
also search the documentation in the databases for specific text
patterns. Any database sequence and its associated documentation can
be retrieved and written into a personal file. You can make a data
library of your own sequences.
BLAST LOOKUP FASTA TFASTA WORDSEARCH
SEGMENTS FRAMESEARCH FINDPATTERNS STRINGSEARCH NAMES
FETCH DATASET ToBLAST DBINDEX
COMPARISON
Sequences can be compared and the comparisons can be visualized
graphically with dot-plots or optimal alignments.
COMPARE DOTPLOT BESTFIT GAP GAPSHOW FRAMEALIGN
OVERLAP NOOVERLAP
DISPLAY
The Display programs make publication-quality displays of plasmids,
manuscripts, figures, and sequences. The output from any GCG
graphics program can be made into a figure and incorporated into a
manuscript that prints out on a PostScript laser printer. Figure and
Red are the programs we use to publish our documentation.
PLASMIDMAP FIGURE RED PUBLISH
EDITING
An editor lets you enter or modify protein or nucleic acid sequences.
To make the entry of nucleic acid sequence data easier, the keys on
the keyboard can be remapped or a digitizer can be used.
SEQED SETKEYS
EVOLUTIONARY_ANALYSIS
The programs in this chapter allow you to investigate the
relationships within a group of aligned sequences. You can compute
the pairwise distances between sequences in an alignment, reconstruct
phylogenetic trees using distance methods, and calculate the degree
of divergence of two protein coding regions.
DISTANCES GROWTREE NEWDIVERGE DIVERGE
FILE_UTILITIES
These utilities act on text files.
LPRINT REPLACE COMPRESSTEXT ONECASE SHIFTOVER
DETAB CHOPUP CRYPT LISTFILE GETTEXT
COUNT EXAMINE FILECHECK
FRAGMENT_ASSEMBLY
The Fragment Assembly programs let you enter and assemble overlapping
nucleotide sequence fragments into one continuous sequence. These
programs are based on the method of Staden (Nucleic Acids Research
8(16); 3673-3694 (1980)).
GELSTART GELENTER
GELMERGE GELASSEMBLE GELVIEW
GELDISASSEMBLE
MAPPING
Restriction digests can be calculated and displayed by several
different programs. A program is provided to simulate RNA
fingerprints. You can also identify sites in sequences that would be
ideal for priming.
MAP MAPPLOT MAPSORT PRIME FINGERPRINT
MULTIPLE_SEQUENCE_ANALYSIS
The first four programs in this chapter allow you to create, edit,
and display multiple sequence alignments. Another program allows you
to represent the multiple sequence alignment as a profile, defining
conserved or variable regions. The profile can be used to search the
databases for similar sequences or in a pair-wise comparison with
another sequence.
PILEUP LINEUP PRETTY PLOTSIMILARITY
Profile_Analysis PROFILEMAKE PROFILESEARCH PROFILESEGMENTS
PROFILEGAP
PATTERN_RECOGNITION
These programs help you recognize peptide coding regions,
terminators, repeats, and other consensus patterns. Several of the
programs are for the analysis of sequence composition.
CODONPREFERENCE TESTCODE FRAMES REPEAT
WINDOW STATPLOT COMPOSITION TERMINATOR
CONSENSUS FITCONSENSUS CODONFREQUENCY CORRESPOND
PROTEIN_ANALYSIS
Most GCG programs work on either protein or nucleotide sequences.
The programs in this chpater, however, do analyses specific to
protein sequences. The first two programs identify sequence motifs
in protein sequences. The next three programs make predictions about
peptide isolation. The last five look at secondary structure,
hydrophobicity, and antigenicity.
MOTIFS PROFILESCAN PEPTIDESORT ISOELECTRIC
PEPTIDEMAP PEPPLOT PEPTIDESTRUCTURE PLOTSTRUCTURE
MOMENT HELICALWHEEL
RNA_SECONDARY_STRUCTURE
Zuker's MFold program predicts optimal and suboptimal RNA secondary
structures, which then can be displayed graphically in any of six
different ways with the PlotFold program. Zuker's older FoldRNA
program predicts optimal RNA secondary structure, which then can be
displayed graphically in any of five different ways. StemLoop looks
for inverted repeats.
MFOLD PLOTFOLD FOLDRNA SQUIGGLES CIRCLES DOMES MOUNTAINS
STEMLOOP
SEQUENCE_EXCHANGE
These utilities convert sequences from one format to another so that
they can be used with different sequence analysis packages.
REFORMAT FROMSTADEN FROMEMBL FROMGENBANK FROMPIR FROMIG
FROMFASTA TOSTADEN TOPIR TOIG TOFASTA GETSEQ
SPEW
TRANSLATION
These programs translate nucleic acids into proteins and proteins
back into nucleic acids.
TRANSLATE BACKTRANSLATE EXTRACTPEPTIDE PEPDATA
A selection of the
most important GCG programs
Compare two
sequences
·
COMPARE
·
DOTPLOT
·
BESTFIT
·
GAP
·
FRAMEALIGN
Database
searches
·
BLAST
·
FASTA
·
TFASTA
·
FRAMESEARCH
·
FINDPATTERNS
·
STRINGSEARCH
Mapping
(Restriction and other)
·
MAP
·
MAPPLOT
·
MAPSORT
·
PLASMIDMAP
·
PRIME
Multiple
sequence alignment
·
PILEUP
·
PRETTY
·
PLOTSIMILARITY
Profile
Analysis
·
PROFILEMAKE
·
PROFILESEARCH
·
PROFILESEGMENTS
Pattern
recognition
·
CODONPREFERENCE
·
FRAMES
·
REPEAT
·
COMPOSITION
·
TERMINATOR
·
CONSENSUS
·
CODONFREQUENCY
Protein
analysis
·
MOTIFS
·
PEPTIDESORT
·
PEPTIDEMAP
·
HELICALWHEEL
RNA folding
·
MFOLD
·
PLOTFOLD
·
FOLDRNA
·
SQUIGGLES
·
STEMLOOP
Conversion
between sequence formats
·
REFORMAT
·
FROMEMBL
·
FROMGENBANK
·
FROMFASTA
·
TOSTADEN
·
TOFASTA
Translation
·
TRANSLATE
·
BACKTRANSLATE
Formatting for
publishing
·
PUBLISH
COMPARE
Compare is the first program of a two-program set that produces
dot-plots. Compare compares two sequences and writes a file of the
points where matches of a certain quality are found. The points in
the output file can be plotted with the DotPlot program.
Dot-plotting is the best method in the Wisconsin Sequence Analysis
Package(TM) for comparing two sequences when you suspect that there
could be more than one segment of similarity between the two.
Compare makes a file with the coordinates of each point where two
sequences are similar. The sequences are compared in every possible
register and a point is added to the file wherever some match
criterion for similarity is met. The match criterion can be met in
two different ways:
The standard way compares two sequences in every register,
searching for all the places where a given number of matches
(stringency) occur within a given range (window). See Maizel
and Lenk (1981) "Enhanced Graphic Matrix Analysis of Nucleic
Acid and Protein Sequences" Proc. Natl. Acad. Sci. USA 78;
7665-7669 for a description of the matrix analysis of biological
sequences.
The other way to find points of similarity is to search for
short perfect matches of some set length. Short perfect matches
are referred to as words. The word comparison between two
sequences is about 1,000 times faster than the window/stringency
match described above, but it requires that the sequences
contain short perfect matches for any similarity to be found.
Word comparison is discussed in detail by Wilbur and Lipman
(1983) "Rapid Similarity Searches of Nucleic Acid and Protein
Data Banks" Proc. Natl. Acad. Sci. USA 80; 726-730. The
authors refer to a word as a k-tuple. Compare does a word
comparison if it is run with the command-line option -WORdsize.
You may limit the number of points that Compare finds with the
command-line option -LIMit.
DOTPLOT
Dot-plotting is the best way to see all of the structures in common
between two sequences or to visualize all of the repeated or inverted
repeated structures in one sequence. DotPlot is the second part of a
two-part set of programs that generate dot-plots of the points of
similarity between two sequences. (See Maizel and Lenk (1981)
"Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein
Sequences" Proc. Natl. Acad. Sci. (USA) 78(12); 7665-7669.)
Compare writes a point file with the coordinates of the points in
common between two sequences. DotPlot plots those points on a
plotter or a graphics terminal.
DotPlot calculates the minimum density in bases per 100 platen units
along either axis that would allow all of the points to be plotted on
a single page. At high densities (for example, 1,000 bases per 100
platen units), the dots will not be individually resolved.
You can select any density you want and DotPlot divides the plot into
as many pages as it takes to plot the whole file. Before you decide
to go ahead, DotPlot tells you how many pages it would take at your
chosen density, and if you have chosen your density interactively,
DotPlot gives you a chance to change your mind. Look at the output
suggestions below for more help on the format.
BESTFIT
BestFit inserts gaps to obtain the optimal alignment of the best
region of similarity between two sequences, and then displays the
alignment in a format similar to the output from Gap. The sequences
can be of very different lengths and have only a small segment of
similarity between them. You could take a short RNA sequence, for
example, and run it against a whole mitochondrial genome.
GAP
Gap considers all possible alignments and gap positions and creates
the alignment with the largest number of matched bases and the fewest
gaps. You provide a gap creation penalty and a gap extension penalty
in units of matched bases. In other words, Gap must make a profit of
gap creation penalty number of matches for each gap it inserts. If
you choose a gap extension penalty greater than zero, Gap must, in
addition, make a profit for each gap inserted of the length of the
gap times the gap extension penalty. Typical values to use as a
point of departure for the gap creation and gap extension penalties
are 5.0 and 0.3, respectively, for nucleic acid sequence comparisons,
and 3.0 and 0.1, respectively, for protein sequence comparisons. Gap
uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48;
443-453 (1970)) that has been shown to be equivalent to Sellers (see
note below).
FRAMEALIGN
FrameAlign inserts gaps to obtain the optimal local alignment of the
best region of similarity between a protein sequence and the codons
in a nucleotide sequence. Because FrameAlign can align the protein
to codons in different reading frames of the nucleotide sequence, it
can identify sequence similarity even when the nucleotide sequence
contains reading frame shifts.
In standard sequence alignment programs, you routinely specify gap
creation and extension penalties. In addition to these penalties,
FrameAlign also allows you to specify a separate frameshift penalty
for the creation of gaps that result in reading frame shifts in the
nucleotide sequence. (See the ALGORITHM topic for a more detailed
explanation of how gaps are penalized.)
By default, FrameAlign creates a local alignment between the
nucleotide and protein sequences. If you specify the -GLObal
command-line qualifier, FrameAlign creates a global alignment where
gaps are inserted to optimize the alignment between the entire
nucleotide sequence and the entire protein sequence.
BLAST
BLAST, or Basic Local Alignment Search Tool, uses the method of
Altschul et al. (J. Mol. Biol. 215; 403-410 (1990)) to search for
similarities between a query sequence and all the sequences in a
database. The query sequence and the database you want to search can
be either protein or nucleic acid in any combination. The GCG BLAST
program supports five different programs in the BLAST family:
BLASTP, Protein Query Searching a Protein Database
Each database sequence is compared to the query in a separate
protein-protein pairwise comparison.
BLASTX, Nucleotide Query Searching a Protein Database
The query is translated, and each of the six products is
compared to each database sequence in a separate protein-protein
pairwise comparison.
BLASTN, Nucleotide Query Searching a Nucleotide Database
Each database sequence is compared to the query in a separate
nucleotide-nucleotide pairwise comparison.
TBLASTN, Protein Query Searching a Nucleotide Database
Each nucleotide database sequence is translated, and each of the
six products is compared to the query in a separate
protein-protein pairwise comparison.
TBLASTX, Nucleotide Query Searching a Nucleotide Database
The query and each database sequence are both translated in six
frames, and each of the 12 products is compared in 36 different
pairwise comparisons. This program is compute-intensive and is
therefore limited to searches of Alu, STS, and EST sequences.
Normally, BLAST decides which BLAST program you want to use simply by
looking at the type (protein or nucleic acid) of your query sequence
and the database you have selected. In the case of
nucleotide-nucleotide searches, there is more than one program that
can do the search. Adding -TBLASTX on the command line means you
want to use the TBLASTX instead of BLASTN.
BLAST either can search databases maintained at your institution (a
local search), or if you are attached to the Internet, it can search
databases maintained by NCBI (a remote search). Remote searches
require almost no resources from your own computer. More
importantly, the databases at NCBI are updated daily and may be more
current than those maintained locally.
BLAST is a statistically driven search method that finds regions of
similarity between your query and a database. Within these regions
of similarity, called segment pairs, the sum of the scoring matrix
values of their constituent symbol pairs is higher than some level
that you would expect to occur by chance alone.
You are prompted to set an expectation level for the entire search.
By default this level is 10.0, which means that hits are not reported
unless their scores are above scores that would be expected to occur
more than 10 times by chance alone in the whole search.
FASTA
FastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci.
USA 85; 2444-2448 (1988)) to search for similarities between one
sequence (the query) and any group of sequences. In the first step
of this search, the comparison can be viewed as a set of dot plots,
with the query as the vertical sequence and the group of sequences to
which the query is being compared as the different horizontal
sequences. This first step finds the registers of comparison
(diagonals) having the largest number of short perfect matches
(words) for each comparison. In the second step, these "best"
regions are rescored using a scoring matrix that allows conservative
replacements, ambiguity symbols, and runs of identities shorter than
the size of a word. In the third step, the program checks to see if
some of these initial highest-scoring diagonals can be joined
together. Finally, the search set sequences with the highest scores
are aligned to the query sequence for display.
What is a Word?
A word is any short sequence (n-mer or k-tuple) where you have
set n to some small constant less than or equal to six. The
word GGATGG is one of the 4,096 possible words of length 6 that
can be created from an alphabet consisting of the four letters
G, A, T, and C. The word QL is one of the 400 possible words of
length 2 that you can make with the 20 letters of the amino acid
alphabet.
TFASTA
TFastA uses the method of Pearson and Lipman (Proc. Natl. Acad. Sci.
USA 85; 2444-2448 (1988)) to search for similarities between a query
peptide sequence and any group of nucleotide sequences. TFastA
translates the nucleotide sequences in all six frames before
performing the comparison. Each translated reading frame is treated
as a separate sequence to be searched. In the first step of this
search, the comparison can be viewed as a set of dot plots, with the
query as the vertical sequence and the group of sequences to which
the query is being compared as the different horizontal sequences.
This first step finds the registers of comparison (diagonals) having
the largest number of short perfect matches (words) for each
comparison. In the second step, these "best" regions are rescored
using a scoring matrix that allows conservative replacements,
ambiguity symbols, and runs of identities shorter than the size of a
word. In the third step, the program checks to see if some of these
initial highest-scoring diagonals can be joined together. Finally,
the search set sequences with the highest scores are aligned to the
query sequence for display.
What is a Word?
A word is any short sequence (n-mer or k-tuple) where you have
set n to some small constant less than or equal to six. The
word GGATGG is one of the 4,096 possible words of length 6 that
can be created from an alphabet consisting of the four letters
G, A, T, and C. The word QL is one of the 400 possible words of
length 2 that you can make with the 20 letters of the amino acid
alphabet.
FRAMESEARCH
FrameSearch searches a group of protein sequences for similarity to
one or more nucleotide query sequences, or searches a group of
nucleotide sequences for similarity to one or more protein query
sequences. For each sequence comparison, the program creates the
optimal local alignment of the best region of similarity between the
protein sequence and all possible codons on each strand of the
nucleotide sequence. Because FrameSearch can match the protein to
codons in different reading frames of the nucleotide sequence as part
of the same alignment, it can identify sequence similarity even when
the nucleotide sequence contains reading frame shifts.
In standard sequence alignment programs, you routinely specify gap
creation and extension penalties. In addition to these penalties,
FrameSearch also allows you to specify a separate frameshift penalty
for the creation of gaps that result in reading frame shifts in the
nucleotide sequence. (See the ALGORITHM topic for a more detailed
explanation of how gaps are penalized.)
By default, the search proceeds as a local alignment between the
query sequence and each sequence in the search set. Optionally, you
can search using a global alignment procedure where FrameSearch
inserts gaps to optimize the alignment between the entire nucleotide
sequence and the entire protein sequence.
The search output contains an ordered list of the sequences in the
search set that have the highest comparison scores when aligned to
the query sequence. The actual alignments for these top-scoring
matches are displayed after the list.
You can specify multiple query sequences (such as a list file or a
sequence specification using an asterisk (*) wildcard) as input to
FrameSearch. The program compares each query sequence separately to
the sequences specified in the search set, and it writes a separate
output file for each query search. If you use a list file as your
query, you can add begin and end sequence attributes to specify the
range for each query sequence. For more information about list
files, see "Using List Files (formerly Files of Sequence Names)" in
Chapter 2, Using Sequences in the User's Guide.
FINDPATTERNS
FindPatterns locates short sequence patterns. If you are trying to
find a pattern in a sequence or if you know of a sequence that you
think occurs somewhere within a larger one, you can find your place
with FindPatterns. FindPatterns can look through large data sets for
any short sequence patterns you specify. FindPatterns can recognize
patterns with some symbols mismatched but not with gaps. It supports
the IUB-IUPAC nucleotide ambiguity codes (see Appendix III) for
searching through nucleotide sequences.
FindPatterns searches both strands of a nucleotide sequence if the
patterns you specify are not identical on both strands. If your
sequence is a peptide, FindPatterns searches for a simple symbol
match between your pattern and the peptide sequence.
FindPatterns names each file on the screen as it is searched. The
output file shows only sequences where a pattern was found unless you
use the command-line option -SHOw. Five symbols from the original
sequence are shown on either side of each "find." The word /Rev
occurs if the reverse of the pattern is found. If you run
FindPatterns with the command-line option -NAMes, the output file is
written in list file (formerly called file of sequence names) format,
which you can use as input to other Wisconsin Sequence Analysis
Package(TM) programs that support indirect file specifications
When FindPatterns finishes searching for your patterns, it returns to
the first prompt in the program, FINDPATTERNS in what sequence(s) ?
If you simply press <Return> at the prompt, FindPatterns stops.
FindPatterns keeps writing its results in the same output file (or on
the screen). FindPatterns prints a short summary on your screen and
in the output file when the entire session is over.
STRINGSEARCH
Annotations and Definitions
In addition to the actual sequence data, GCG databases contain
two additional types of data: sequence annotations, and
definitions.
The annotations contain the complete documentation for each
entry in the sequence database, including journal and author
names, sequence features, comments, etc. The annotations appear
at the top of sequences copied from a GCG database with the
Fetch program.
The definitions contain a minimal amount of the annotations
documentation for each entry: the name of the organism, the name
of the gene, the sequence length, and usually the date.
Definitions for the GenBank, EMBL, and SWISS-PROT databases also
contain the primary accession number for the sequences. The
definitions are reprinted in the Data Files manual.
The StringSearch program searches through either the definitions
alone or the complete sequence annotations for text patterns
that you specify. Annotations take much longer to search than
definitions.
Searching Sequence Definitions
The expression % stringsearch GenBank:* human finds every entry
in the GenBank sequence database whose definition contains the
text pattern human. The databases available in addition to
GenBank are EMBL, SWISS-PROT, and PIR-Protein. GenEMBL
specifies the sequences in both GenBank and EMBL. Additionally,
definitions searches can be done on any of the individual
divisions in GenBank and/or EMBL. If you believe that a
published human sequence in the database is 1,531-bases long,
you can search for entries that contain both human and 1531.
When searching definitions, you can specify the set of sequences
you want to search in the same way as for all other Wisconsin
Sequence Analysis Package(TM) programs with the following
exception. The specified sequences must be contained in a
database; you cannot search the definitions of user sequences.
For instance, the specification Primate:hum* would search
through the definitions for all of the sequences in the Primate
division of GenBank that begin with the pattern hum. You may
also specify the database sequences to search by means of a list
file (formerly called a file of sequence names). Each sequence
in a list file must be preceded by a logical name for one of the
databases or database divisions. Sequence specification is
described in detail in Chapter 2, Using Sequences of the User's
Guide.
Searching Complete Sequence Annotations
When you are searching complete sequence annotations, you can
specify the set of sequences you want to search in the same way
as for all other Wisconsin Package(TM) programs. Sequence
specification is described in detail in Chapter 2, Using
Sequences of the User's Guide.
If your sequence specification is not preceded by a logical
name, StringSearch looks in all of the databases and in all of
the GCG data files to find all possible entry names. The
specification hum* would be almost the same as GenBank:hum*,
except that if any sequences beginning with hum were present in
databases other than GenBank, they would also be searched. A
search of all the entries in all the databases takes a very long
time.
Special Considerations for Searching
Keep in mind that file names are case sensitive and
database entry names are case insensitive. Because this
program searches for both file names and database entry
names, you must take care when you enter the character
pattern that makes up your specification.
For example, if you entered Gamma* as a file specification,
this program would find all entries in the databases whose
names begin with Gamma but no GCG-supplied files would be
found. This is because all the files in the Wisconsin
Package are named using lowercase letters. Conversely, if
you entered gamma*, this program would find all of the
entries in the databases and all the GCG-supplied files
whose names begin with gamma.
Searching for More Than One Pattern
You can search for more than one text pattern by answering the
Search for what text patterns? prompt with a response like
Human,Globin. StringSearch then finds all the entries that
contain both human and globin. You can set StringSearch to show
all the entries that contain either human or globin with the
command-line option -MATch=OR.
Specifying Patterns
Blank spaces are removed from the beginning of each pattern
unless that pattern is enclosed in double quotes. For instance,
specifying the pattern Globin shows all entries that contain
globin, while specifying " Globin" excludes entries containing
terms like myoglobin in which globin is not preceded by a space.
To specify a double quote (") as part of a pattern, use two
double quote marks (""). To specify a comma as part of a
pattern, enclose that pattern in quotes.
MAP
Map displays a sequence that is being assembled or analyzed
intensively. Map asks you to enter the names of those enzymes whose
restriction sites should be marked. If you do not answer this
question, Map generates a restriction map with a representative
isoschizomer from all of the commercially available enzymes. You can
choose to have your sequence translated in any of the six possible
translation frames. You can also choose to have only the open
reading frames translated.
After running Map, you may create a new sequence file with the
peptide sequence from any frame of DNA translation by using the
ExtractPeptide program with the Map output file.
MAPPLOT
MapPlot is a tool for genetic engineering. It helps you visualize
how part of a DNA molecule may be isolated. MapPlot uses color to
distinguish the types of overhang left after digestion (5' overhangs
are green, 3' overhangs are blue, blunt ends are black, and
undetermined overhangs are red). The site, cut position, and total
number of cuts are also shown for each enzyme. The enzymes that do
not cut are listed below the plot. You may choose to plot only
enzymes that have six base recognition sites or enzymes that cut the
molecule only once.
MAPSORT
MapSort is the best way to predict how the fragments of an enzyme
digest will look on a gel. You can tell at a glance which enzymes
cut a molecule in a given region and whether other fragments of
similar size could confound isolation of the fragment of interest.
You can concatenate your sequence with its vector before running
MapSort to see if a single step isolation is possible and you can
examine the pattern of fragments from a multi-enzyme digest. The
sequence can be treated as if it were circular or linear. You can
see which enzymes cut the sequence only once. Enzymes that cut the
sequence, as well as those that do not, are shown at the bottom of
the output. The output therefore contains a complete list of all the
enzymes considered. You can see the cut sites graphically with
MapPlot or PlasmidMap.
PLASMIDMAP
PlasmidMap has two purposes: 1) to display and store information
about plasmid constructs; and 2) to publish or present information
about such constructs.
PlasmidMap reads information about a plasmid from an input file and
produces a labeled circular plot of that plasmid. The input file
specifies the position of the labels and their styles -- tick, block,
or range. An example of a suitable input file -- in this case
containing restriction site information -- is the output file from a
MapSort session run with the command-line option -PLAsmid.
PlasmidMap can accept more than one labeling file as input, plotting
the information from each file on the same circular map. You can
specify more than one file by using a name containing wildcard
characters or by using a file of filenames (a text file you create
that starts with a line containing two dots (..) and has one filename
per line thereafter). PlasmidMap simply transfers information from
these labeling files to graph paper; it does not map the sequence by
itself.
PlasmidMap can draw three kinds of labels:
Ticks
Tick labels are associated with a single base and are printed
next to a tick around the outside of the circle. Tick labels
have been used historically to identify restriction sites and
coordinates.
Blocks
Block labels are associated with blocks drawn inside the circle.
Blocks can be shaded or left empty.
Ranges
Range labels are associated with lines drawn inside the circle.
Ranges can start and end with arrow heads, arrow tails, blunt
ends, junction heads, and junction tails. The arc for a range
can be narrow or bold.
Every label is defined by a name, a starting coordinate, an ending
coordinate, a strand, a color, and a style (tick, block or range). A
range must also include a beginning and ending character to indicate
the type of head and tail to draw. The format of the input file is
described in detail below.
You can set over 30 different parameters for PlasmidMap. Therefore,
the values of the parameters are normally defined in an initializing
file rather than by a series of interactive prompts. All parameters
have defaults, so only parameters that you wish to change need to be
specified in the initializing file.
As with all Wisconsin Sequence Analysis Package(TM) programs, you can
override the values set in the initializing file by using
command-line parameters. Values set on the command line take
precedence over values in the initializing file.
PRIME
Prime analyzes a template DNA sequence and chooses primer pairs for
the polymerase chain reaction (PCR) and primers for DNA sequencing.
For PCR primer pair selection, you can choose a target range of the
template sequence to be amplified. For DNA sequencing primers, you
can specify positions on the template that must be included in the
sequencing.
In selecting appropriate primers, Prime considers a variety of
constraints on the primer and amplified product sequences. You
either can use the program's default constraint values or modify
those values to customize the analysis. You can specify upper and
lower limits for primer and product melting temperatures and for
primer and product GC contents. For primers, you can specify a range
of acceptable primer sizes, any required bases at the 3' end of the
primer (3' clamp), and a maximum difference in primer melting
temperatures for PCR primer pairs. For PCR products, you can specify
a range of acceptable product sizes.
For efficient priming, you should avoid primers with extensive
self-complementarity in order to minimize primer secondary structure
and primer dimer formation. Additionally, in PCR experiments, primer
pairs with extensive complementarity between the two primers should
be avoided in order to minimize primer dimer formation. Prime uses
the annealing test described in the ALGORITHM topic to check
individual primers for self-complementarity and to check the two
primers in a PCR primer pair for complementarity to each other.
Using this same annealing test, Prime optionally can screen against
non-specific primer binding on the template sequence and on any
repeated sequences you specify.
The terms forward primer and reverse primer are used in the remainder
of this document and in the program output. Forward primers are
complementary to sequences on the reverse template strand and create
copies of the forward strand by primer extension. Conversely,
reverse primers are complementary to sequences on the forward
template strand and create copies of the reverse strand by primer
extension.
PILEUP
PileUp creates a multiple sequence alignment using a simplification
of the progressive alignment method of Feng and Doolittle (Journal of
Molecular Evolution 25; 351-360 (1987)). The method used is similar
to the method described by Higgins and Sharp (CABIOS 5; 151-153
(1989)).
The multiple alignment procedure begins with the pairwise alignment
of the two most similar sequences, producing a cluster of two aligned
sequences. This cluster can then be aligned to the next most related
sequence or cluster of aligned sequences. Two clusters of sequences
can be aligned by a simple extension of the pairwise alignment of two
individual sequences. The final alignment is achieved by a series of
progressive, pairwise alignments that include increasingly dissimilar
sequences and clusters, until all sequences have been included in the
final pairwise alignment.
Before alignment, the sequences are first clustered by similarity to
produce a dendrogram, or tree representation of clustering
relationships. It is this dendrogram that directs the order of the
subsequent pairwise alignments. PileUp can plot this dendrogram so
that you can see the order of the pairwise alignments that created
the final alignment.
As a general rule, PileUp can align up to 500 sequences, with any
single sequence in the final alignment restricted to a maximum length
of 7,000 characters (including gap characters inserted into the
sequence by PileUp to create the alignment). However, if you include
long sequences in the alignment, the number of sequences PileUp can
align decreases. See the RESTRICTIONS topic, below, for a more
complete discussion of sequence number and size limitations.
PRETTY
Pretty prints sequences with their columns aligned and can display a
consensus for the alignment, allowing you to look at relationships
among the sequences. This program can be used for aligned sequences
in an MSF (multiple sequence format) file, or for separate sequences
that have had gaps added to make them all align.
You can change the alignments displayed by Pretty with a text editor.
The output from Pretty can then be separated into individual sequence
files by running Pretty with the command line option -UGLy.
PLOTSIMILARITY
PlotSimilarity calculates the average similarity among all members of
a group of aligned sequences at each position in the alignment, using
a user-specified sliding window of comparison. The window of
comparison is moved along all sequences, one position at a time, and
the average similarity over the entire window is plotted at the
middle position of the window. The average similarity across the
entire alignment is plotted as a dotted line.
If you give PlotSimilarity a single input sequence, you can choose
the range and strand for that sequence, and then PlotSimilarity
prompts you for the name, range, and strand of a second input
sequence. In this way, you can plot the average similarity between
the two aligned sequences created with % gap -OUT.
PROFILEMAKE
There is an essay on profile analysis in the Multiple Sequence
Analysis chapter of the Program Manual.
ProfileMake uses the method of Gribskov, et al (Proc. Natl. Acad.
Sci. USA 84; 4355-4358 (1987)) to create a profile from a group of
aligned sequences. A profile is a table that contains all of the
comparison information of a group of aligned sequences. These
sequences must be previously aligned (see the RELATED PROGRAMS topic
below) before running ProfileMake. The profile contains as many rows
as there are positions in the aligned sequences. Each row contains a
score for the alignment of the corresponding position of the aligned
sequences with each possible base or residue.
The profile is the input data for ProfileSearch, which can find
sequences in the database similar to your group of aligned sequences,
and ProfileGap, which can make an optimal alignment between the
aligned sequences and another sequence.
The aligned sequences may be specified to ProfileMake with an
ambiguous file expression or in a list file similar to the input for
Pretty or LineUp. (See Chapter 2, Using Sequences in the User's
Guide for more information.)
PROFILESEARCH
There is an essay on profile analysis in the Multiple Sequence
Analysis chapter of the Program Manual.
Using the method of Gribskov, et al. (In Methods in Enzymology, 183;
146-159 (1989)), ProfileSearch accepts a profile from ProfileMake and
uses it to search a database (or any set of sequences you specify)
for sequences that are similar to the aligned probe sequences used to
create the profile. The algorithm calculates the score (quality) of
the optimal alignment between the profile and each sequence in the
database and creates a list of all of the sequences in the database
with an alignment score above some threshold. The results of
ProfileSearch are corrected for compositional effects of the sequence
and for systematic effects of the sequence length on the score. The
output list can be displayed as optimal alignments with
ProfileSegments.
The gap creation and gap extension penalties specified for
ProfileSearch are maximum values. The actual position-specific gap
penalties at any position are determined by multiplying the gap
creation penalty by the percent value in the second to the last
column of the profile, and the gap extension penalty by the percent
value in the last column of the profile.
ProfileSearch does a lot of computing so you will probably want to
run it in the batch queue (see the CONSIDERATIONS topic below).
PROFILESEGMENTS
There is an essay on profile analysis in the Multiple Sequence
Analysis chapter of the Program Manual.
ProfileSearch and ProfileSegments use the method of Gribskov, et al
(Proc. Natl. Acad. Sci. USA 84;4355-4358 (1987)). ProfileSearch
compares a profile to a set of sequences and lists the sequences with
similarity to the profile. ProfileSegments then makes an optimal
alignment to display the best segment of similarity in each sequence
in the list. ProfileSegments uses the alignment procedure of Smith
and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to
search for and align the segments. The scoring matrix values, gap
creation penalties, and gap extension penalties used to find the best
region of similarity between the profile and the sequence are all
present in the input file itself and need not be set.
CODONPREFERENCE
CodonPreference finds regions of each reading frame in a DNA sequence
that show either strong codon preference or unusual compositional
bias in the third (wobble) position of each codon. CodonPreference
is useful for locating protein coding regions, determining their
reading frames, estimating the level of expression of a gene, and
locating DNA sequencing errors.
The Preference Curves
The codon preference statistic for each reading frame shows the
similarity of the codon usage in a window of that reading frame
to a previously calculated codon usage table. The statistic
used for the comparison is described below. The codon usage
table is a file of the kind generated by the CodonFrequency
program. A window of a given size is moved along the sequence
in increments of one codon (three bases) and the statistic is
recalculated at every position to make a continuous function.
The statistic for the correct reading frame of a real gene may
rise significantly above the background if a suitable codon
frequency table is used and if the codon usage of the sequence
is strongly biased. Suppress the codon preference curves with
-NOPREFerence.
The Bias Curves
The bias for each reading frame is the fraction of the third
position in each codon that is either G or C. The bias of other
nucleotides may be seen by putting an expression on the command
line like -BIAS=AT. Suppress the bias curves with -NOBIAS.
Errors in Sequence Data
A sequencing error that causes a frame shift may make the curves
for the correct reading frame fall at the same time that one of
the incorrect frames rises. See the example below.
The Open Reading Frame Display
Open reading frames are shown as boxes beneath the plot for
their respective translation frames. Potential start codons are
shown as short lines that extend above the top of the box, and
stop codons extend below the bottom of the box. By default,
only the start and stop codons at the ends of open reading
frames are shown in the frame display. This can be altered with
the -ALLFrames command-line option so that all start and stop
codons are shown. Suppress the frame display with -NOFRAMes.
The Rare Codon Display
Rare codon choices in each reading frame are marked below the
open reading frame plot. A codon is considered rare when its
fraction in the codon frequency table falls below the threshold
you set with an expression like -RARe=0.1. Suppress the rare
codon display with -NORARe.
Regions of Known Interest
Regions of known interest can be marked below the x-axis using
the -MARk command line option.
FRAMES
Frames plots open reading frames of a nucleic acid sequences as boxes
bordered by potential start and stop codons. Potential start codons
are shown as short lines that extend above the box and potential stop
codons are shown as short lines that extend below the box. By
default, only the start and stop codons at the ends of open reading
frames are shown in the frame display; if a stop codon has been
passed, no stops are shown again until a start codon is passed; if a
start codon is passed, no start codons are shown again until a stop
codon is passed. You can display all start and stop codons with the
-ALL command-line option. Frames examines all six translation frames
of a sequence.
If your sequence could have intervening sequences in it, use the
command line option -ALL to get Frames to show all of the possible
start and stop codons in your sequence.
If you know the pattern of codon usage in the system under study,
Frames marks the occurrence of rare codons in each reading frame.
Use the command line option -RARe to have the rare codons marked. If
you want to use a codon frequency file other than ecohigh.cod, you
can specify this file on the command line with an expression like
-RARe=codontablename.cod.
REPEAT
Repeat lets you choose a minimum repeat window and stringency and a
search range and then finds all the repeats of at least that size and
stringency within the search range chosen. The repeats are sorted by
position and displayed in an output file as alignments of those parts
of the sequence that make up the repeats. Repeat tells you the
number of repeats found for your settings of window and stringency
before filing the results. If you feel there are too many repeats,
you may reset the parameters before writing the repeats out to a
file. You can limit the number of repeats shown, or sort the repeats
by quality so that the longest repeats come at the top of the list.
See the ALGORITHM topic below to understand precisely what Repeat
does.
COMPOSITION
Composition measures the composition of one or a group of sequences.
If you specify only one sequence, you can choose a range within the
sequence. Lowercase letters are converted to uppercase and counted
with their uppercase equivalents. If you specify a group of
sequences, Composition displays the name of each sequence as it
finishes the measurement for that sequence.
TERMINATOR
Terminator uses a table of the dinucleotide frequencies for each
position from a set of known terminators to find places in a new
sequence where terminator-like sequences occur. Terminator finds all
discrete examples in the searched sequence where a measurement falls
above some user-defined threshold value. The measurement for each
alignment of the table over the sequence is the sum of the values in
the table for each dinucleotide from the sequence. The method can
also restrict the set of terminatorlike sequences shown to those that
fall above some threshold for the presence of a GC-rich dyad symmetry
near the poly-U region.
The method used by Terminator is described in detail in two papers:
Brendel, V. and Trifonov, E. N., Nucl. Acids Res. 12 4411-4427 (1984)
and Brendel, V. and Trifonov, E. N. in CODATA Conference Proceedings,
Jerusalem, 1984. Any use of Terminator that results in publication
should cite these papers.
CODONFREQUENCY
CodonFrequency counts codons and writes their frequencies into codon
frequency tables. It counts the codons from ranges within sequences
or existing codon frequency tables. The output table is a file with
the sum of all the observations for each of the 64 possible codons.
This file is suitable for input to other GCG programs, including
BackTranslate, CodonPreference, and Correspond.
CodonFrequency supports the assembly of fragments from circular
molecules by letting you define a range in the sequence that extends
across the end and into the beginning of a molecule. The terminal
bell rings when a circular range is chosen.
To count codons from sequences, specify ranges until you have
assembled a sequence you want to count. For each range,
CodonFrequency shows you the starting and ending symbols to double
check that you have chosen the range and strand accurately.
After choosing each range, you must decide if you would like to add
another exon to the gene or count the codons in the gene you have
assembled. It is critical that you count the codon frequencies from
multi-exon genes after assembling all the ranges since intervening
sequences often interrupt such genes within a codon, thus destroying
the reading frame.
After CodonFrequency counts all the codons in your gene, you may
specify another gene from the current sequence file or get other
sequence files or codon frequency tables.
You can specify multiple sequences (such as a list file or sequence
specification using an asterisk (*) wildcard) to count codons from
more than one sequence at a time. By default, each sequence in a
multiple sequence specification is treated as a separate gene and
counted separately by CodonFrequency. If you add the -ONEPEPtide
command-line qualifier, then all sequences in a multiple sequence
specification are concatentated together into a single sequence
before counting codons. If you use a list file to specify multiple
sequences, you can add begin, end, and strand sequence attributes to
specify the range and strand for each sequence. For more information
about list files, see "Using List Files (formerly Files of Sequence
Names)" in Chapter 2, Using Sequences in the User's Guide.
After each sequence range is counted or each new codon frequency
table is read, CodonFrequency asks if you want to write the data to a
file. If you choose to do so, the program writes a file with the
number of observations for each codon. In addition, CodonFrequency
normalizes the codon observations to a frequency per thousand and to
a fraction for each codon within its synonymous family.
MOTIFS
Motifs looks for protein sequence motifs by checking your protein
sequence for every sequence pattern in the PROSITE Dictionary.
Motifs can recognize the patterns with some of the symbols
mismatched, but not with gaps. Currently, Motifs can only search for
patterns in protein sequences.
There is a very informative abstract on every motif in the PROSITE
Dictionary. These abstracts are displayed next to any motif found in
your sequence.
The PROSITE Dictionary was written by Dr. Amos Bairoch of the
University of Geneva -- a labor of love.
PEPTIDESORT
PeptideSort cuts a peptide sequence with any or all of the
proteolytic enzymes and reagents listed in the public or local data
file proenzall.dat. The peptides from each digest are sorted by
position, weight, and retention time in a high-pressure liquid
chromatograph at pH 2.1. For each peptide in each sorting, the
following data are displayed: beginning and ending positions,
molecular weight, HPLC retention at pH 2.1, HPLC retention at pH 7.4,
charge, number of aromatic residues, number of acidic residues,
number of basic residues, number of residues containing sulfur,
number of hydrophilic residues, and number of hydrophobic residues.
The content, isoelectric point, and molar extinction coefficient at
280 nm of each peptide are shown with the table of peptides sorted by
position. The content can be displayed in the order of expected
elution from an amino acid analyzer.
PEPTIDEMAP
PeptideMap marks a peptide sequence at every position where a known
proteolytic enzyme or reagent might cut it. You can select one or a
few enzymes or let PeptideMap use the whole list.
PeptideMap is simply the program Map run with the command-line option
-PROGRAMname=PeptideMap. (See the documentation for Map in the
Program Manual for a complete description.)
HELICALWHEEL
HelicalWheel plots a helical wheel representation of a peptide
sequence. Each residue is offset from the preceding one by 100
degrees, the typical angle of rotation for an alpha-helix. The angle
of rotation can be changed using the -ANGle and -BETa command-line
options.
MFOLD
MFold is an adaptation of the mfold package by Zuker and Jaeger that
has been modified to work with the Wisconsin Package(TM). Their
method uses the energy rules developed by Turner and colleagues to
determine optimal and suboptimal secondary structures for an RNA
molecule. (See the ACKNOWLEDGEMENTS topic for references.)
Using energy minimization criteria, any predicted "optimal" secondary
structure for an RNA molecule depends on the model of RNA folding and
the specific folding energies used to calculate that structure.
Different optimal foldings may be calculated if the folding energies
are changed even slightly. Because of uncertainties in the folding
model and the folding energies, the "correct" folding may not be the
"optimal" folding determined by the program. You may therefore want
to view many optimal and suboptimal structures within a few percent
of the minimum energy. You can use the variation among these
structures to determine which regions of the secondary structure you
can predict reliably. For instance, a region of the RNA molecule
containing the same helix in most calculated optimal and suboptimal
secondary structures may be more reliably predicted than other
regions with greater variation.
MFold calculates energy matrices that determine all optimal and
suboptimal secondary structures for an RNA molecule. The program
writes these energy matrices to an output file. A companion program,
PlotFold, reads this output file and displays a representative set of
optimal and suboptimal secondary structures for the RNA molecule,
within any increment of the computed minimum free energy you choose.
You can choose any of several different graphic representations for
displaying the secondary structures in PlotFold.
PLOTFOLD
MFold uses the method of Zuker (see the MFold entry in this manual
for more information) to determine optimal and suboptimal secondary
structures for an RNA molecule. MFold writes an output file
containing energy matrices that determine all optimal and suboptimal
foldings for the RNA molecule. PlotFold reads this output file and
displays representations of the optimal and suboptimal foldings.
PlotFold allows you to choose from among seven different
representations of the optimal and suboptimal foldings determined by
MFold. Each of these options allows you to specify an energy
increment, which is the highest deviation in kcal/mole from the
computed free energy minimum that a structure can have in order to be
plotted. For example, if the predicted optimal folding has a
computed free energy of -114.5 kcal/mole, and you select a
5.7 kcal/mole energy increment, then all plotted structures must have
calculated free energies no greater than -108.8 kcal/mole
(-114.5 + 5.7 = -108.8).
The first two folding representations, the energy dotplot and the
p-num plot, plot base pairing information from all secondary
structures that have free energies within the energy increment you
specify. You can use these displays to determine which regions of
the secondary structure prediction are well defined (see the example
sessions below).
Energy Dotplot (menu option A)
The energy dotplot indicates all of the base pairs involved in
all optimal and suboptimal secondary structures within the
energy increment you specify. The plot takes the form of a
two-dimensional graph where both axes of the graph represent the
same RNA sequence. Each point drawn in the graph indicates a
base pair between the ribonucleotides whose positions in the
sequence are the coordinates of that point on the graph.
P-Num Plot (menu option B)
The p-num plot graphs the amount of variability in pairing at
each position in the RNA molecule. For each position of the
sequence along the horizontal axis, the height of the plot
indicates how many different pairing partners the program finds
in all predicted foldings within the energy increment you
specify.
The remaining six PlotFold options plot a sampling of specific
secondary structures that have calculated free energies within the
energy increment you specify. These remaining options all display
the same information, but in different forms. Rather than attempting
to plot each secondary structure whose computed free energy falls
within the energy increment, PlotFold selects representative foldings
that are sufficiently different from each other and are still within
the specified energy increment. You can specify how different each
folding must be from the others in response to the window size
program prompt. To understand the concept of a window size, we must
first define the idea of a distance between base pairs in different
foldings. The distance between two base pairs, r(i)-r(j) and
r(i')-r(j') is the greater of |i - i'| and |j - j'|. Each listed
folding must have at least a window size number of base pairs that
are greater than window size distance from the base pairs in any
other listed folding.
Each of these five PlotFold options also lets you select the number
of structures to plot. Since the number of structures that meet both
the energy increment and window size criteria may be less than your
selection, fewer secondary structures may actually be plotted.
Circles Plot (menu option C)
The circles plot makes a circular Nussinov plot of an RNA
secondary structure. The circular graph represents the sequence
as a segment of a circle. You can set the radius and the
angular width of one base so that plots of different secondary
structures are strictly comparable. Arcs or chords connect
paired bases; hairpin, bulge, interior, and multibranched loops
are easily seen.
Domes Plot (menu option D)
The domes plot represents a folded RNA sequence as a line with
elliptical arcs connecting paired bases. This representation
has the property that the length of the arc is proportional to
the distance (along the primary structure) between the bases;
all loops are easily seen.
Mountains Plot (menu option E)
The mountains plot makes a graph that looks like a mountain
range. Horizontal striations upon a particular peak are bonds
between paired bases, and vertical links between the horizontal
striations represent stems.
Squiggles Plot (menu option F)
The squiggles plot is a representation similar to what you might
draw by hand; that is, bonds formed between bases are drawn as
chords. Bases are shown participating in stems, as well as in
hairpin, bulge, interior, and multipbranched loops.
Text Output (menu option G)
The text output representation of the RNA secondary structure is
similar to the squiggles plot, but you don't need a graphics
device to see it. The output is written into a text file that
you can view with any text editor. If you exclude a region of
the RNA molecule from folding in MFold with either the
-CLOSedexcise or -OPENexcise option, you can view the predicted
secondary structures only as text output; do not use any of the
graphic plotting options of PlotFold to display the results.
Connect File Output (menu option H)
The Connect file is a base-by-base text output file of optimal
and suboptimal RNA secondary structures. This file can be used
as input to several programs that produce graphical
representations of RNA secondary structures. In the Wisconsin
Package(TM), the Squiggles, Circles, Domes, and Mountains
programs can read the Connect file output of MFold and display
the secondary structure of the optimal structure, only. Other
publicly available RNA secondary structure display programs may
be able to display all optimal and suboptimal secondary
structures listed in the Connect file.
The examples below demonstrate three different PlotFold options for
the secondary structure representations.
FOLDRNA
FoldRNA finds a secondary structure of minimum free energy for an RNA
molecule based on published values of stacking and loop destabilizing
energies. FoldRNA is the program of Michael Zuker (Methods in
Enzymology, 180, 262-288(1989). The energies used by Zuker's program
were first described by Winston Salser (CSHSQB 42; 985) and are now
defined by Turner (Freier et al., Proc. Natl. Acad. Sci. USA 83:
9373-9377 (1986)).
You should be aware of the limitations of energy minimizing
algorithms in predicting real secondary structures. The structure
reported in the output file is only one of a family of structures
that have the same or nearly the same energy. The number of
structures that have similar energies to the optimal structure
reported by FoldRNA may be very large when several hundred bases are
folded or when the secondary structure is not strong.
SQUIGGLES
The program FoldRNA calculates an optimal RNA secondary structure and
writes a base-by-base output file with the sequence name and the file
name extension .connect. The Circles, Domes, DotPlot, Mountains, and
Squiggles programs accept the FoldRNA output file and graph the RNA
secondary structure. Circles, Domes, DotPlot, and Mountains make
abstract representations, but the graphs they draw are easier to
compare if you are looking for secondary structure motifs.
Squiggles creates a representation similar to what you might draw by
hand; that is, bonds formed between bases are drawn as chords. The
entire structure looks like an airport or intersecting railroad
tracks. The spider-like graph plotted by Squiggles represents the
sequence as bases connected by chords. Bases are shown participating
in stems, as well as in hairpin, bulge, interior and bifurcation
loops. These structures are easily seen in Squiggles.
To make your graph more readable, you can label the bases, show
sequence numbers at set intervals, and pivot up to nine stems. You
can also change the character height for the number and base labels.
STEMLOOP
StemLoop searches for inverted repeats in your sequence after you
choose a minimum stem length and minimum and maximum loop sizes. You
must also specify a minimum number of bonds per stem with G-T, A-T,
and G-C scored as 1, 2, and 3 bonds, respectively. The stems found
can be sorted by position, size (stem length), or quality (number of
bonds) and can be either filed or displayed on the screen. StemLoop
tells you the number of stems found for your settings of minimum stem
size, maximum loop size, minimum loop size, and minimum bonds per
stem. If you feel there are too many stems, you may reset the
parameters without reviewing the stems found or view only the best
stems found. To view only the best stems, there must be more than 25
stems found and you must sort them by quality or size. (See the
ALGORITHM topic below to understand precisely what StemLoop does.)
REFORMAT
Reformat rewrites sequence files to make them usable by the Wisconsin
Sequence Analysis Package(TM) or to alter their appearance. The
following are some of the manipulations that Reformat can perform:
- converting sequence files that were prepared or edited with a text
editor or transferred to your computer from another computer
into GCG format.
- converting between multiple sequence file (MSF) format and
individual sequences in GCG format.
- correcting the sequence type (protein or nucleic acid) of sequence
files that have no type or that were incorrectly typed when they
were created.
- converting nucleic acid sequences between DNA (T, t) and RNA (U,
u) representations.
- converting peptide sequences between one-letter and three-letter
amino acid representations.
- converting sequences to all uppercase or all lowercase characters.
- removing gap characters from sequence files.
Reformat can also be used to rewrite into GCG format MSF files that
you've edited with a text editor.
In order to use Reformat on sequence files, the files must contain a
heading, a dividing line, and a sequence, as described below. You
can use a text editor to make your "foreign" sequence files conform
to this arrangement.
FROMEMBL
Use FromEMBL when you want to use sequences in EMBL's distribution
format with the Wisconsin Sequence Analysis Package(TM). Since EMBL
maintains many sequences in one file, FromEMBL must write many output
files, one for each sequence in the EMBL file. Each output file is
named according to the identifier word on the ID line at the
beginning of each sequence entry. All documentation from the EMBL
input files is preserved in the GCG output files. The nucleic acid
ambiguity codes are preserved except that the hyphen (-) symbol in
the EMBL sequences is changed to N in the GCG files.
FROMGENBANK
Use FromGenBank when you want to use sequences in the GenBank flat
file distribution format with the Wisconsin Sequence Analysis
Package(TM). Since GenBank maintains many sequences in one file,
FromGenBank must write many output files, one for each sequence in
the GenBank file. Each output file is named according to the
identifier word on the LOCUS line at the beginning of each sequence
entry. All the documentation from the GenBank input files is
preserved in the GCG output files.
FROMFASTA
Use FromFastA when you want to convert sequences that are in FastA
format into a format suitable for use with programs in the Wisconsin
Sequence Analysis Package(TM). FastA format may maintain many
sequences in one file; in such a case FromFastA writes many output
files, one for each sequence in the FastA file. Each output file is
named according to the first word (following the > character) on the
documention line just above the sequence data in the FastA file. The
documentation line from the FastA input file(s) is preserved in the
GCG output file(s). FromFastA can convert sets of FastA sequence
files that are specified with a file of filenames or with multiple
file specification syntax.
TOSTADEN
Any sequence file in GCG format can be converted with ToStaden into a
format suitable for use in the Staden programs. If the sequence is a
nucleic acid sequence, the compatible ambiguity codes are converted
from the IUB-IUPAC versions to Staden's versions. You can see how
they are converted in Appendix III or in the example below.
TOFASTA
Sequences in GCG format can be converted into a format suitable for
use by programs that require sequences in FastA format. ToFastA
accepts one or more GCG sequences as input and by default creates one
output file containing all the sequences in FastA format. However,
NCBI's BLAST family of programs accepts only one sequence per input
file. Therefore, if you put -BLAst on the command line, ToFastA
writes your output into separate files, naming each output file with
the input sequence's name and the filename extension .tfa.
TRANSLATE
Translate creates a peptide sequence by translating nucleic acid
sequences that you specify. In addition to translating a single
range of a given nucleotide sequence, it can concatenate exons into a
single assembly for translation. The exons can be of any length, can
come from either strand, and can come from more than one sequence
file. Unlike most Wisconsin Package(TM) programs, Translate lets you
specify ranges that extend across the end and into the beginning of a
sequence. The terminal bell rings when a circular range is chosen.
Translate can be run either interactively or noninteractively. When
you specify a single sequence to translate and -Default is not on the
command line, it works interactively, prompting you for each segment
to translate. To run Translate noninteractively, either use the
-Default command-line switch or supply a multiple file specification
for the input file by means of a wild card file specification or a
list file. (See the INPUT FILE topic below for more detailed
information.)
Translate supports the IUB-IUPAC character set for the representation
of nucleotide ambiguity. See Appendix III for a list of the IUB
codes and their meanings.
BACKTRANSLATE
BackTranslate uses a peptide sequence and a codon usage file to make
a table of possible backtranslations. The program can write all of
the codons for each amino acid in the peptide sequence, showing the
frequency of each within its synonymous group.
BackTranslate also calculates either the most probable
back-translation or the fully ambiguous depiction of the nucleotide
sequence. In either case, it writes the implied sequence such that
the output file can be used as the input file for other sequence
mapping or comparison programs.
In the table of back-translations options, the codons are written in
order of their preference in the codon frequency table. Below each
block of synonymous codons, there is a number between 0 and 1,000; it
is the product of the probabilities for the most likely codon for the
next four amino acids multiplied by 1,000. The higher the number,
the more likely it is that the next 12 nucleotides contain the
most-preferred codons. All of the codons and their preferences are
included to help you look critically at the alternative
oligonucleotides that you might want to synthesize. For instance, if
for one amino acid the codon preference is not strong, you could
consider making a mixture that contains all of the different
possibilities.
PUBLISH
Publish creates a text file that you can customize for publication.
You can choose lines that represent the sequence, its complement, a
decimal scale, a three-letter translation, a one-letter translation,
a numbering line with numbers at every twentieth symbol, a blank
line, or a tagged line. Additional sequences can be shown either
completely or with only the differences marked. A match line between
any two sequence lines marks the matches. The lines can appear in
any order. Publish can number each line starting from any number you
choose. The line is numbered at both ends if you select the line
from the menu by typing its menu letter in uppercase. Each type of
line is described in detail below. The output can be blocked using
two parameters: the block size, and the number of blocks per line.
The ranges of translation must be chosen by you for each translation
line selected. You can translate as many non-overlapping ranges as
you want for each translation line you select. If you have
overlapping ranges of translation, you must select two translation
lines and set the ranges appropriately.
Acknowledgement
A few pieces of this
document are from the very good manual written by Hans Mehlin, Karolinska
Institute, Stockholm and which is available at http://kisac.cmb.ki.se/senn/files.html.