INTRODUCTION TO MOLECULAR BIOLOGY

DATABASES AND SEQUENCE ANALYSIS

9-13 March 1998

Tore Samuelsson

Göteborg University

Dept of Medical Biochemistry

The flow of genetic information and bioinformatics

DNA -> RNA -> protein -> conformation

Databases in bioinformatics

- Structure

PDB (Protein Data Bank, Brookhaven Natl Lab, www.pdb.bnl.gov)

Xray crystallography

NMR

modeling

KLOTHO (small molecules, www.ibc.wustl.edu/klotho/)

- Sequence

DNA

Genbank (www.ncbi.nlm.nih.gov)

- Homologisökning: /BLAST/)

- Entrez : /Entrez/

GSDB (Genome sequence database)

EMBL (European Molecular Biology Laboratory, www.ebi.ac.uk)

- SRS : srs.ebi.ac.uk:5000

DDBJ (DNA Data Bank of Japan)

Protein

Swissprot (www.ebi.ac.uk)

- Genome

GDB (Human Genome Data Base, gdbwww.gdb.org)

Mouse genome database (www.informatics.jax.org)

Yeast genome (genome-ftp.stanford.edu/Saccharomyces)

Bacterial genomes (www.tigr.org)

- Genetic disorders

OMIM (Online Mendelian Inheritance in Man, www.ncbi.nlm.nih.gov)

- Taxonomy (www.ncbi.nlm.nih.gov)

- Prosite (expasy.hcuge.ch)

- Literature

Medline (www.ncbi.nlm.nih.gov)

Structure databases.

The Brookhaven Protein Data Bank

PDB: Number of Entries Deposited and Released by Year
as of 02/11/98

Year	Deposited	Released
1973	10	10
1974	7	2
1975	18	21
1976	47	27
1977	27	38
1978	26	26
1979	32	31
1980	20	30
1981	39	26
1982	59	47
1983	27	50
1984	36	29
1985	27	30
1986	28	29
1987	64	28
1988	129	79
1989	192	89
1990	306	164
1991	512	205
1992	635	226
1993	922	849
1994	1111	1392
1995	1221	1002
1996	1448	1224
1997	1848	1640
1998	207	262

NOTE: Computing the totals for number of entries released will produce a sum greater than what is currently available from PDB. This is due to entries being withdrawn and/or replaced.

Number of Entries Deposited (Bar) and Average Time to Release (Line)
Accumulated and Averaged on a Quarterly Basis

Bar Graph - Number of Entries in the Following Categories:

· OnHold - (red) On-hold per depositor request

· Processing - (green) Being processed

· Released - (blue) Released

Line Graph

AVERAGE - Average number of days to release

The data were accumulated and averaged on a quarterly basis. The average turn around times for entries now being processed are estimated based on the average of the last 12 months.

Data for the last quarter are accumulated until the date specified on the graph.

View Data in Tabular Form Statistics on contents and growth of the PDB

PDB Holdings List

Entries loaded on February 25, 1998

7163 Released Atomic Coordinate Entries

1736 Structure Factor Files
400 NMR Restraint Files

Molecule Type

6339 proteins, peptides, and viruses
284 protein/nucleic acid complexes
528 nucleic acids
12 carbohydrates
0 others

Experimental Technique

5850 diffraction and other
1133 NMR
180 theoretical modeling

Count By Experiment

Diffraction

5261 proteins,peptides, and viruses
346 nucleic acids
233 protein/nucleic acid complexes
10 carbohydrates
0 others

NMR

922 proteins,peptides, and viruses
167 nucleic acids
42 protein/nucleic acid complexes
2 carbohydrates
0 others

Model

156 proteins,peptides, and viruses
15 nucleic acids
9 protein/nucleic acid complexes
0 carbohydrates
0 others

Example of PDB entry

HEADER HORMONE 30-OCT-92 1BPH 1BPH 2

COMPND INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 1BPH 3

SOURCE BOVINE (BOS $TAURUS) PANCREAS 1BPH 4

AUTHOR O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 5

REVDAT 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1BPHA 1

REVDAT 1 15-JAN-93 1BPH 0 1BPH 6

JRNL AUTH O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 7

JRNL TITL CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS 1BPH 8

JRNL TITL 2 IN THE PH RANGE 7-11 1BPH 9

JRNL REF BIOPHYS.J. V. 63 1210 1992 1BPH 10

JRNL REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 11

REMARK 1 1BPH 12

REMARK 1 REFERENCE 1 1BPH 13

REMARK 1 AUTH O.GURSKY,Y.LI,J.BADGER,D.L.D.CASPAR 1BPH 14

REMARK 1 TITL MONOVALENT CATION BINDING IN CUBIC INSULIN 1BPH 15

REMARK 1 TITL 2 CRYSTALS 1BPH 16

REMARK 1 REF BIOPHYS.J. V. 61 604 1992 1BPH 17

REMARK 1 REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 18

REMARK 1 REFERENCE 2 1BPH 19

REMARK 1 AUTH J.BADGER 1BPH 20

REMARK 1 TITL FLEXIBILITY IN CRYSTALLINE INSULINS 1BPH 21

REMARK 1 REF BIOPHYS.J. V. 61 816 1992 1BPH 22

REMARK 1 REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 23

REMARK 1 REFERENCE 3 1BPHA 2

REMARK 1 AUTH J.BADGER,M.R.HARRIS,C.D.REYNOLDS,A.C.EVANS, 1BPH 25

REMARK 1 AUTH 2 E.J.DODSON,G.G.DODSON,A.C.T.NORTH 1BPH 26

REMARK 1 TITL STRUCTURE OF THE PIG INSULIN DIMER IN THE CUBIC 1BPH 27

REMARK 1 TITL 2 CRYSTAL 1BPH 28

REMARK 1 REF ACTA CRYSTALLOGR.,SECT.B V. 47 127 1991 1BPH 29

REMARK 1 REFN ASTM ASBSDK DK ISSN 0108-7681 622 1BPH 30

REMARK 1 REFERENCE 4 1BPHA 3

REMARK 1 AUTH J.BADGER,D.L.D.CASPAR 1BPH 32

REMARK 1 TITL WATER STRUCTURE IN CUBIC INSULIN CRYSTALS 1BPH 33

REMARK 1 REF PROC.NAT.ACAD.SCI.USA V. 88 622 1991 1BPH 34

REMARK 1 REFN ASTM PNASA6 US ISSN 0027-8424 040 1BPH 35

REMARK 1 REFERENCE 5 1BPHA 4

REMARK 1 AUTH E.J.DODSON,G.G.DODSON,A.LEWITOVA,M.SABESAN 1BPH 37

REMARK 1 TITL ZINC-FREE CUBIC PIG INSULIN: CRYSTALLIZATION AND 1BPH 38

REMARK 1 TITL 2 STRUCTURE DETERMINATION 1BPH 39

REMARK 1 REF J.MOL.BIOL. V. 125 387 1978 1BPH 40

REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 070 1BPH 41

REMARK 2 1BPH 42

REMARK 2 RESOLUTION. 2.0 ANGSTROMS. 1BPH 43

REMARK 3 1BPH 44

REMARK 3 REFINEMENT. 1BPH 45

REMARK 3 PROGRAM PROLSQ 1BPH 46

REMARK 3 AUTHORS HENDRICKSON AND KONNERT 1BPH 47

REMARK 3 R VALUE 0.160 1BPH 48

REMARK 3 RMSD BOND DISTANCES 0.014 ANGSTROMS 1BPH 49

REMARK 3 RMSD BOND ANGLE DISTANCES 0.043 ANGSTROMS 1BPH 50

REMARK 4 1BPH 51

REMARK 4 THIS CRYSTAL FORM CONTAINS ONE INSULIN MOLECULE PER 1BPH 52

REMARK 4 ASYMMETRIC UNIT. THE SOLVENT VOLUME IS 64 PERCENT OF THE 1BPH 53

REMARK 4 CRYSTAL VOLUME. THERE ARE MANY ALTERED SIDE CHAIN TORSION 1BPH 54

REMARK 4 ANGLES AND MAIN CHAIN DISPLACEMENTS IN THE CUBIC CRYSTAL 1BPH 55

REMARK 4 STRUCTURE COMPARED TO OTHER INSULIN CRYSTAL FORMS. ABOUT 1BPH 56

REMARK 4 30 PER CENT OF THE AMINO ACID RESIDUES CAN ADOPT MULTIPLE 1BPH 57

REMARK 4 CONFORMATIONS WHICH WERE RELIABLY IDENTIFIED BY COMPARISON 1BPH 58

REMARK 4 OF THE DATA SETS COLLECTED FROM THE CRYSTALS IN THE PH 1BPH 59

REMARK 4 RANGE 7 - 11. THE WEIGHTS OF MANY OF SUCH MULTIPLE PROTEIN 1BPH 60

REMARK 4 AND SOLVENT CONFORMATIONS DEPEND ON SOLVENT IONIC 1BPH 61

REMARK 4 CONDITIONS (PH AND SALT CONCENTRATION). 1BPH 62

REMARK 5 1BPH 63

REMARK 5 THERE ARE FOUR RELATED ENTRIES: 1BPH 64

REMARK 5 1APH 0.1M SODIUM SALT SOLUTION AT PH 7 1BPH 65

REMARK 5 1BPH 0.1M SODIUM SALT SOLUTION AT PH 9 1BPH 66

REMARK 5 1CPH 0.1M SODIUM SALT SOLUTION AT PH 10 1BPH 67

REMARK 5 1DPH 1.0M SODIUM SALT SOLUTION AT PH 11 1BPH 68

REMARK 6 1BPH 69

REMARK 6 IN 1BPH AND 1CPH, THE SIDE CHAIN OF GLU A 4 CAN ADOPT TWO 1BPH 70

REMARK 6 ALTERNATIVE POSITIONS WHICH OVERLAP. THEIR RELATIVE WEIGHT 1BPH 71

REMARK 6 AND THE ATOMIC POSITIONS OF THE SECOND CONFORMER ARE NOT 1BPH 72

REMARK 6 ACCURATELY DETERMINED. 1BPH 73

REMARK 7 1BPH 74

REMARK 7 IN 1APH, 1BPH, AND 1DPH, THE SIDE CHAIN OF GLU B 21 IS 1BPH 75

REMARK 7 DISORDERED. IT HAS BEEN MODELED AS SUPERPOSITION OF TWO 1BPH 76

REMARK 7 CONFORMATIONS BUT ATOMIC POSITIONS FOR THESE CONFORMATIONS 1BPH 77

REMARK 7 ARE PROBABLY NOT VERY ACCURATE. 1BPH 78

REMARK 8 1BPH 79

REMARK 8 THE SIDE CHAIN OF LYS B 29 IS POORLY DEFINED IN THE 1BPH 80

REMARK 8 ELECTRON DENSITY MAPS. IN 1APH AND 1CPH, IT IS INCLUDED 1BPH 81

REMARK 8 WITH PARTIAL OCCUPANCY. IN 1BPH AND 1DPH, ITS COORDINATES 1BPH 82

REMARK 8 HAVE BEEN OMITTED FROM THE ENTRY. 1BPH 83

REMARK 13 1BPHA 5

REMARK 13 CORRECTION. RENUMBER REFERENCES SEQUENTIALLY. INSERT 1BPHA 6

REMARK 13 MISSING HET AND FORMUL RECORDS FOR NA. 31-OCT-93. 1BPHA 7

SEQRES 1 A 21 GLY ILE VAL GLU GLN CYS CYS ALA SER VAL CYS SER LEU 1BPH 106

SEQRES 2 A 21 TYR GLN LEU GLU ASN TYR CYS ASN 1BPH 107

SEQRES 1 B 30 PHE VAL ASN GLN HIS LEU CYS GLY SER HIS LEU VAL GLU 1BPH 108

SEQRES 2 B 30 ALA LEU TYR LEU VAL CYS GLY GLU ARG GLY PHE PHE TYR 1BPH 109

SEQRES 3 B 30 THR PRO LYS ALA 1BPH 110

HET DCE 200 4 1,2-DICHLOROETHANE(ETHYLENE DICHLORIDE) 1BPH 111

HET NA 88 1 SODIUM ION 1BPHA 8

FORMUL 3 DCE C2 H4 CL2 1BPH 112

FORMUL 4 NA NA1 1BPHA 9

FORMUL 5 HOH *55(H2 O1) 1BPHA 10

HELIX 1 A1 GLY A 1 VAL A 10 1 1BPH 114

HELIX 2 A2 SER A 12 GLU A 17 5 NOT IDEAL 1BPH 115

HELIX 3 B1 SER B 9 GLY B 20 1 1BPH 116

TURN 1 1B1 CYS B 19 ARG B 22 1BPH 117

TURN 2 1B2 GLY B 20 GLY B 23 1BPH 118

SSBOND 1 CYS A 6 CYS A 11 1BPH 119

SSBOND 2 CYS A 7 CYS B 7 1BPH 120

SSBOND 3 CYS A 20 CYS B 19 1BPH 121

CRYST1 78.900 78.900 78.900 90.00 90.00 90.00 I 21 3 24 1BPH 122

ORIGX1 1.000000 0.000000 0.000000 0.00000 1BPH 123

ORIGX2 0.000000 1.000000 0.000000 0.00000 1BPH 124

ORIGX3 0.000000 0.000000 1.000000 0.00000 1BPH 125

SCALE1 0.012674 0.000000 0.000000 0.00000 1BPH 126

SCALE2 0.000000 0.012674 0.000000 0.00000 1BPH 127

SCALE3 0.000000 0.000000 0.012674 0.00000 1BPH 128

ATOM 1 N GLY A 1 13.994 47.196 31.798 1.00 35.87 1BPH 129

ATOM 2 CA GLY A 1 14.277 46.226 30.708 1.00 38.67 1BPH 130

ATOM 3 C GLY A 1 15.574 45.507 31.085 1.00 31.18 1BPH 131

ATOM 4 O GLY A 1 16.078 45.660 32.217 1.00 22.60 1BPH 132

ATOM 5 N ILE A 2 16.088 44.766 30.126 1.00 28.39 1BPH 133

ATOM 6 CA ILE A 2 17.342 44.034 30.404 1.00 23.76 1BPH 134

ATOM 7 C ILE A 2 18.526 44.939 30.686 1.00 25.29 1BPH 135

ATOM 8 O ILE A 2 19.425 44.457 31.392 1.00 18.74 1BPH 136

ATOM 9 CB ILE A 2 17.571 43.072 29.158 1.00 27.36 1BPH 137

ATOM 10 CG1 ILE A 2 18.638 42.049 29.605 1.00 18.03 1BPH 138

ATOM 11 CG2 ILE A 2 17.859 43.936 27.903 1.00 25.54 1BPH 139

ATOM 12 CD1 ILE A 2 18.914 40.930 28.590 1.00 17.07 1BPH 140

ATOM 13 N VAL A 3 18.619 46.195 30.192 1.00 24.42 1BPH 141

ATOM 14 CA VAL A 3 19.774 47.080 30.436 1.00 30.26 1BPH 142

ATOM 15 C VAL A 3 19.952 47.453 31.895 1.00 19.08 1BPH 143

ATOM 16 O VAL A 3 21.018 47.421 32.561 1.00 28.15 1BPH 144

ATOM 17 CB VAL A 3 19.719 48.274 29.462 1.00 33.87 1BPH 145

ATOM 18 CG1 VAL A 3 20.847 49.225 29.754 1.00 30.40 1BPH 146

ATOM 19 CG2 VAL A 3 19.868 47.724 28.044 1.00 24.51 1BPH 147

ATOM 127 N GLU A 17 17.257 34.367 30.913 1.00 17.57 1BPH 255

ATOM 128 CA GLU A 17 16.353 33.393 30.338 1.00 13.26 1BPH 256

ATOM 129 C GLU A 17 14.968 33.889 30.001 1.00 22.70 1BPH 257

ATOM 130 O GLU A 17 14.234 33.275 29.212 1.00 25.00 1BPH 258

ATOM 131 CB GLU A 17 16.183 32.146 31.209 1.00 17.01 1BPH 259

ATOM 132 CG GLU A 17 17.252 31.160 30.695 1.00 14.38 1BPH 260

ATOM 133 CD GLU A 17 16.968 29.843 31.385 1.00 24.91 1BPH 261

ATOM 134 OE1 GLU A 17 16.230 29.713 32.350 1.00 25.72 1BPH 262

ATOM 135 OE2 GLU A 17 17.675 28.984 30.830 1.00 22.42 1BPH 263

ATOM 136 N ASN A 18 14.618 35.021 30.563 1.00 22.30 1BPH 264

ATOM 137 CA ASN A 18 13.371 35.753 30.369 1.00 29.65 1BPH 265

ATOM 138 C ASN A 18 13.330 36.318 28.943 1.00 23.17 1BPH 266

ATOM 139 O ASN A 18 12.197 36.611 28.486 1.00 30.58 1BPH 267

ATOM 172 N PHE B 1 28.961 32.694 34.302 1.00 38.09 1BPH 300

ATOM 173 CA PHE B 1 29.545 33.933 33.691 1.00 44.75 1BPH 301

ATOM 174 C PHE B 1 28.483 35.030 33.562 1.00 18.46 1BPH 302

ATOM 175 O PHE B 1 28.656 36.170 33.083 1.00 29.15 1BPH 303

ATOM 176 CB PHE B 1 30.190 33.486 32.346 1.00 36.50 1BPH 304

ATOM 177 CG PHE B 1 29.191 32.986 31.322 1.00 29.77 1BPH 305

ATOM 178 CD1 PHE B 1 28.691 31.688 31.351 1.00 22.29 1BPH 306

ATOM 179 CD2 PHE B 1 28.736 33.844 30.327 1.00 30.11 1BPH 307

ATOM 180 CE1 PHE B 1 27.758 31.234 30.415 1.00 30.11 1BPH 308

ATOM 181 CE2 PHE B 1 27.822 33.423 29.377 1.00 29.49 1BPH 309

ATOM 182 CZ PHE B 1 27.329 32.125 29.428 1.00 27.29 1BPH 310

ATOM 183 N VAL B 2 27.235 34.671 33.935 1.00 25.09 1BPH 311

ATOM 184 CA VAL B 2 26.085 35.571 33.793 1.00 23.88 1BPH 312

ATOM 185 C VAL B 2 25.902 36.506 34.969 1.00 24.42 1BPH 313

ATOM 186 O VAL B 2 25.269 37.560 34.801 1.00 19.63 1BPH 314

ATOM 187 CB VAL B 2 24.846 34.751 33.391 1.00 28.89 1BPH 315

ATOM 413 N PRO B 28 16.809 47.082 24.129 1.00 39.30 1BPH 541

ATOM 414 CA PRO B 28 17.550 47.958 25.065 1.00 50.32 1BPH 542

ATOM 415 C PRO B 28 16.747 49.100 25.692 1.00 51.41 1BPH 543

ATOM 416 O PRO B 28 16.922 49.526 26.848 1.00 52.87 1BPH 544

ATOM 417 CB PRO B 28 18.744 48.435 24.231 1.00 33.07 1BPH 545

ATOM 418 CG PRO B 28 18.261 48.353 22.779 1.00 28.91 1BPH 546

ATOM 419 CD PRO B 28 17.355 47.133 22.751 1.00 30.72 1BPH 547

ATOM 420 N LYS B 29 15.830 49.593 24.905 1.00 58.03 1BPH 548

ATOM 421 CA ALYS B 29 14.935 50.708 25.214 0.50 56.38 1BPH 549

ATOM 422 CA BLYS B 29 15.106 50.841 24.970 0.50 57.81 1BPH 550

ATOM 423 C ALYS B 29 13.602 50.396 25.876 0.50 73.09 1BPH 551

ATOM 424 C BLYS B 29 13.915 50.201 25.692 0.50 66.40 1BPH 552

ATOM 425 O ALYS B 29 13.044 51.332 26.517 0.50 80.92 1BPH 553

ATOM 426 O BLYS B 29 12.908 49.842 25.053 0.50 53.34 1BPH 554

ATOM 427 CB ALYS B 29 14.689 51.541 23.932 0.50 58.98 1BPH 555

ATOM 428 CB BLYS B 29 14.658 51.386 23.598 0.50 45.66 1BPH 556

ATOM 429 N AALA B 30 13.056 49.194 25.782 0.50 74.55 1BPH 557

ATOM 430 N BALA B 30 14.075 50.102 27.005 0.50 71.75 1BPH 558

ATOM 431 CA AALA B 30 11.762 48.878 26.416 0.50 75.29 1BPH 559

ATOM 432 CA BALA B 30 13.075 49.536 27.915 0.50 73.80 1BPH 560

ATOM 433 C AALA B 30 11.853 47.818 27.515 0.50 68.10 1BPH 561

ATOM 434 C BALA B 30 12.867 50.426 29.144 0.50 73.94 1BPH 562

ATOM 435 O AALA B 30 10.774 47.235 27.799 0.50 65.90 1BPH 563

ATOM 436 O BALA B 30 12.394 49.828 30.144 0.50 69.68 1BPH 564

ATOM 437 CB AALA B 30 10.728 48.457 25.375 0.50 76.93 1BPH 565

ATOM 438 CB BALA B 30 13.512 48.144 28.366 0.50 73.70 1BPH 566

ATOM 439 OXTAALA B 30 12.952 47.610 28.048 0.50 63.45 1BPH 567

ATOM 440 OXTBALA B 30 13.182 51.623 29.061 0.50 76.41 1BPH 568

TER 441 ALA B 30 1BPH 569

HETATM 442 CL1 DCE 200 26.950 41.213 19.536 0.50 34.85 1BPH 570

HETATM 443 C1 DCE 200 28.222 40.003 20.178 0.50 24.42 1BPH 571

HETATM 444 C2 DCE 200 28.307 38.776 19.363 0.50 24.99 1BPH 572

HETATM 445 CL2 DCE 200 26.941 37.681 19.833 0.50 33.75 1BPH 573

HETATM 446 NA NA 88 20.339 43.145 38.263 0.50 13.22 1BPH 574

HETATM 447 O HOH 1 26.102 28.408 28.110 0.33 28.57 1BPH 575

HETATM 448 O HOH 2 26.719 28.525 28.242 0.66 30.29 1BPH 576

HETATM 449 O HOH 3 19.213 33.037 38.295 1.00 42.10 1BPH 577

HETATM 450 O HOH 4 21.104 32.216 20.645 1.00 26.61 1BPH 578

HETATM 451 O HOH 5 21.954 33.637 38.117 1.00 22.77 1BPH 579

HETATM 498 O HOH 52 19.217 52.503 35.050 1.00 68.12 1BPH 626

HETATM 499 O HOH 53 15.376 24.434 25.540 1.00 82.81 1BPH 627

HETATM 500 O HOH 54 21.768 55.234 32.076 1.00 85.97 1BPH 628

HETATM 501 O HOH 55 22.667 52.737 33.359 1.00 81.22 1BPH 629

CONECT 48 47 78 1BPH 630

CONECT 54 53 235 1BPH 631

CONECT 78 48 77 1BPH 632

CONECT 161 160 331 1BPH 633

CONECT 235 54 234 1BPH 634

CONECT 331 161 330 1BPH 635

CONECT 442 443 1BPH 636

CONECT 443 442 444 1BPH 637

CONECT 444 443 445 1BPH 638

CONECT 445 444 1BPH 639

MASTER 97 0 2 3 0 2 0 6 499 2 10 5 1BPHA 11

END 1BPH

3D viewers

Rasmol www.umass.edu/microbio/rasmol/

Weblab www.msi.com

Kinemage www.cryst.bbk.ac.uk/PPS/vsns-pps/technology/kinemage.html

Chime www.umass.edu/microbio/rasmol/

----

Sequence databases

- DNA Sequence

Genbank (www.ncbi.nlm.nih.gov)

GSDB (Genome sequence database)

EMBL (European Molecular Biology Laboratory)(www.ebi.ac.uk)

DDBJ (DNA Data Bank of Japan)

Current statistics of EMBL: (www.ebi.ac.uk)

The EMBL Nucleotide Sequence Database

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currently doubles in size approximately every 12 months and currently (December 1997) contains over 1,281 million bases from 1,917,868 sequence entries.

The complete database is available every three months via subscription on CD-ROM. Computer network services additionally provide access to the very latest data as well as sequence similarity searches.

Documentation

· Release Notes

· Release 53, December 1997.

· Release 52, October 1997.

· Release 51, June 1997.

· Release 50, March 1997.

· Release 49, December 1996.

· Release 48, September 1996.

· Older releases

· User Manual, December 1997.

· DDBJ/EMBL/GenBank Feature Table Definition, v2.0, December 15, 1997.

Access

· Subscription to full releases on CD-ROM.

· FTP archive

· Full release (Release 53, December 1997).

· Updates, daily, weekly and cumulative.

· Entry retrieval by database accession number.

· E-mail access to data and sequence search services.

· EMBnet nodes.

Last modified: Thu January 08 1998 17:42

Growth of EMBL database

EMBL and Genbank formats

EMBL format

ID LISOD standard; DNA; PRO; 756 BP.

AC X64011; S78972;

NI g44010

DT 28-APR-1992 (Rel. 31, Created)

DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)

DE L.ivanovii sod gene for superoxide dismutase

XX.

KW sod gene; superoxide dismutase.

OS Listeria ivanovii

OC Eubacteria; Firmicutes; Low G+C gram-positive bacteria; Bacillaceae;

OC Listeria.

RN [1]

RA Haas A., Goebel W.;

RT "Cloning of a superoxide dismutase gene from Listeria ivanovii by

RT functional complementation in Escherichia coli and

RT characterization of the gene product.";

RL Mol. Gen. Genet. 231:313-322(1992).

RN [2]

RP 1-756

RA Kreft J.;

RT ;

RL Submitted (21-APR-1992) on tape to the EMBL Data Library by:

RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg,

RL Biozentrum Am Hubland, 8700 Wuerzburg, FRG

DR SWISS-PROT; P28763; SODM_LISIV.

FH Key Location/Qualifiers

FT source 1..756

FT /organism="Listeria ivanovii"

FT /strain="ATCC 19119"

FT RBS 95..100

FT /gene="sod"

FT CDS 109..717

FT /gene="sod"

FT /EC_number="1.15.1.1"

FT /product="superoxide dismutase"

FT /db_xref="PID:g44011"

FT /db_xref="SWISS-PROT:P28763"

FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAV

FT SGHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGN

FT LKAAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGK

FT TPVLGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK*"

FT terminator 723..746

FT /gene="sod"

SQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;

CGTTATTTAA GGTGTTACAT AGTTCTATGG AAATAGGGTC TATACCTTTC GCCTTACAAT 60

GTAATTTCTT TTCACATAAA TAATAAACAA TCCGAGGAGG AATTTTTAAT GACTTACGAA 120

TTACCAAAAT TACCTTATAC TTATGATGCT TTGGAGCCGA ATTTTGATAA AGAAACAATG 180

GAAATTCACT ATACAAAGCA CCACAATATT TATGTAACAA AACTAAATGA AGCAGTCTCA 240

GGACACGCAG AACTTGCAAG TAAACCTGGG GAAGAATTAG TTGCTAATCT AGATAGCGTT 300

CCTGAAGAAA TTCGTGGCGC AGTACGTAAC CACGGTGGTG GACATGCTAA CCATACTTTA 360

TTCTGGTCTA GTCTTAGCCC AAATGGTGGT GGTGCTCCAA CTGGTAACTT AAAAGCAGCA 420

ATCGAAAGCG AATTCGGCAC ATTTGATGAA TTCAAAGAAA AATTCAATGC GGCAGCTGCG 480

GCTCGTTTTG GTTCAGGATG GGCATGGCTA GTAGTGAACA ATGGTAAACT AGAAATTGTT 540

TCCACTGCTA ACCAAGATTC TCCACTTAGC GAAGGTAAAA CTCCAGTTCT TGGCTTAGAT 600

GTTTGGGAAC ATGCTTATTA TCTTAAATTC CAAAACCGTC GTCCTGAATA CATTGACACA 660

TTTTGGAATG TAATTAACTG GGATGAACGA AATAAACGCT TTGACGCAGC AAAATAATTA 720

TCGAAAGGCT CACTTAGGTG GGTCTTTTTA TTTCTA 756

GenBank Format

LOCUS LISOD 756 bp DNA BCT 30-JUN-1993

DEFINITION L.ivanovii sod gene for superoxide dismutase.

ACCESSION X64011 S78972

NID g44010

KEYWORDS sod gene; superoxide dismutase.

SOURCE Listeria ivanovii.

ORGANISM Listeria ivanovii

Eubacteria; Firmicutes; Low G+C gram-positive bacteria;

Bacillaceae; Listeria.

REFERENCE 1 (bases 1 to 756)

AUTHORS Haas,A. and Goebel,W.

TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii by

functional complementation in Escherichia coli and characterization

of the gene product

JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992)

MEDLINE 92140371

REFERENCE 2 (bases 1 to 756)

AUTHORS Kreft,J.

TITLE Direct Submission

JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,

Universitaet Wuerzburg, Biozentrum Am Hubland, 8700 Wuerzburg, FRG

FEATURES Location/Qualifiers

source 1..756

/organism="Listeria ivanovii"

/strain="ATCC 19119"

/db_xref="taxon:1638"

RBS 95..100

/gene="sod"

gene 95..746

/gene="sod"

CDS 109..717

/gene="sod"

/EC_number="1.15.1.1"

/codon_start=1

/product="superoxide dismutase"

/db_xref="PID:g44011"

/db_xref="SWISS-PROT:P28763"

/transl_table=11

/translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVS

GHAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLK

AAIESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPV

LGLDVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"

terminator 723..746

/gene="sod"

BASE COUNT 247 a 136 c 151 g 222 t

ORIGIN

1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat

61 gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa

121 ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg

181 gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca

241 ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt

301 cctgaagaaa ttcgtggcgc agtacgtaac cacggtggtg gacatgctaa ccatacttta

361 ttctggtcta gtcttagccc aaatggtggt ggtgctccaa ctggtaactt aaaagcagca

421 atcgaaagcg aattcggcac atttgatgaa ttcaaagaaa aattcaatgc ggcagctgcg

481 gctcgttttg gttcaggatg ggcatggcta gtagtgaaca atggtaaact agaaattgtt

541 tccactgcta accaagattc tccacttagc gaaggtaaaa ctccagttct tggcttagat

601 gtttgggaac atgcttatta tcttaaattc caaaaccgtc gtcctgaata cattgacaca

661 ttttggaatg taattaactg ggatgaacga aataaacgct ttgacgcagc aaaataatta

721 tcgaaaggct cacttaggtg ggtcttttta tttcta

FORMAT OF THE EMBL DATABASE

The nucleotide sequence database is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed or reported in the literature. In many cases, entries have been assembled from several papers reporting overlapping sequence regions. Conversely, a single paper often provides data for several entries, as when homologous sequences from different organisms are compared.

· 3.1 Classes of Data

· 3.2 Database Divisions

· 3.3 Structure of an Entry

· 3.4 Line Structure

· 3.4.1 The ID Line

· 3.4.2 The AC Line

· 3.4.3 The NI Line

· 3.4.4 The DT Line

· 3.4.5 The DE Line

· 3.4.6 The KW Line

· 3.4.7 The OS Line

· 3.4.8 The OC Line

· 3.4.9 The OG Line

· 3.4.10 The Reference (RN, RC, RP, RX, RA, RT, RL) Lines

· 3.4.10.1 The RN Line

· 3.4.10.2 The RC Line

· 3.4.10.3 The RP Line

· 3.4.10.4 The RX Line

· 3.4.10.5 The RA Line

· 3.4.10.6 The RT Line

· 3.4.10.7 The RL Line

· 3.4.11 The DR Line

· 3.4.12 The FH Line

· 3.4.13 The FT Line

· 3.4.14 The SQ Line

· 3.4.15 The Sequence Data Line

· 3.4.16 The CC Line

· 3.4.17 The XX Line

· 3.4.18 The // Line

3.2.4 Feature key examples

Key Description

conflict Separate determinations of the "same" sequence differ

rep_origin Origin of replication

protein_bind Protein binding site on DNA

CDS Protein-coding sequence

misc_RNA Generic label for an undefined RNA

insertion_seq Insertion element

D-loop Mitochondrial or other D-loop structure

3.3.4 Qualifier examples

Key Location/Qualifiers

CDS 86..742

/product="hypoxanthine phosphoribosyltransferase"

/label=hprt

/note="hprt catalyzes vital steps in the

reutilization pathway for purine biosynthesis

and its deficiency leads to forms of ""gouty"" arthritis"

rep.origin 234..243

/direction=left

CDS 109..564

/usedin=X10009:catalase

3.5.3 Location examples

The following is a list of common location descriptors with their meanings:

Location Description

467 Points to a single base in the presented sequence

340..565 Points to a continuous range of bases bounded by and including the starting and ending bases

<345..500 Indicates that the exact lower boundary point of a

feature is unknown. The location begins at some

base previous to the first base specified (which need

not be contained in the presented sequence) and con-

tinues to and includes the ending base

<1..888 The feature starts before the first sequenced base and

continues to and includes base 888

(102.110) Indicates that the exact location is unknown but that

it is one of the bases between bases 102 and 110, in-

clusive

(23.45)..600 Specifies that the starting point is one of the bases between bases 23 and 45, inclusive, and the end point is base 600

(122.133)..(204.221) The feature starts at a base between 122 and 133, inclusive, and ends at a base between 204 and 221, inclusive

123^124 Points to a site between bases 123 and 124

145^177 Points to a site between two adjacent bases anywhere

between bases 145 and 177

complement(34..(122.126)) Start at one of the bases complementary to those between 122 and 126 on the presented strand and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand)

join("acct",449..670) Concatenate the four bases 'acct' to the 5' end of the

sequence from bases 449 to 670, inclusive

J00193:hladr Points to a feature whose location is described in an-

other entry: the feature labelled 'hladr' in the entry (in this database) with primary accession number 'J00193'

J00194:(100..202) Points to bases 100 to 202, inclusive, in the entry (in

this database) with primary accession number

'J00194'

EMBL divisions

1 RELEASE 53

The EMBL Nucleotide Sequence Database was frozen to make Release 53 on the 16th December 1997. The release contains 1,917,868 sequence entries comprising 1,281,391,651 nucleotides. This represents an increase of about 8% over Release 52. A breakdown of Release 53 by division is shown below:

Division Entries Nucleotides

----------------- ------- -----------

Bacteriophage 1388 2188305

ESTs 1343796 496603984

Fungi 18137 44602064

GSSs 100154 49099107

HTG 1868 102763872

Human 74384 139022655

Invertebrates 28126 107524431

Organelles 24715 22870076

Other Mammals 14429 13785092

Other Vertebrates 13145 14653255

Plants 22136 37736590

Patent 91221 29511807

Prokaryotes 42666 102750354

Rodents 37043 46489741

STSs 51172 17685717

Synthetic 2424 5377292

Unclassified 2380 2387088

Viruses 48684 46340221

----------------- -------- ------------

Total 1917868 1281391651

----------------- -------- ------------

hum Human Sequences | |

rod RodentSequences |>Mammals |

mam Other Mammal Sequences | |> Vertebrates

vrt Other Vertebrate Sequences |

STS (Sequence Tagged Sites)

Sequence Tagged Sites (STS) are short DNA segments with a single location in the

genome. This feature of STS makes them useful tags for mapping.

EST (Expressed Sequence Tag)

Description Expressed Sequence Tags (ESTs) are sequences of cDNA which have been

reverse-transcribed from mRNA and their function is not necessarily known. They have

applications in the discovery of new genes, mapping of various genomes, and identification of

coding regions in genomic sequences.

Genome Survey Sequence (GSS)

Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.

dbGSS release 030598 - Summary by Organism - March 5, 1998

Number of public entries: 88,416

Homo sapiens (human) 64,533

Arabidopsis thaliana 22,511

Trypanosoma brucei brucei 455

Trypanosoma brucei rhodesiense 324

Cryptosporidium parvum 200

Mus musculus 163

Rhodobacter sphaeroides 151

Enterococcus faecalis 41

Helicobacter pylori 21

Brugia malayi 11

Leishmania major 6

High-Throughput Genomic Sequences

The High Throughput Genomic (HTG) Sequences division was created to accommodate a growing need to make 'unfinished' genomic sequence data rapidly available to the scientific community. It was done in a coordinated effort between the three International Nucleotide Sequence databases: DDBJ, EMBL, and GenBank. The HTG division contains 'unfinished' DNA sequences generated by the high-throughput sequencing centers. Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month. The HTG division of GenBank was described in a [Genome Research (1997) 7(10)] article by Ouellette and Boguski.

Location of HTG records:

Unfinished HTG sequences containing contigs greater than 2 kb are

assigned an accession number and deposited in the HTG division. A typical

HTG record might consist of all the first pass sequence data generated from

a single cosmid, BAC, YAC, or P1 clone which together comprise more

than 2 kb and contain one or more gaps. A single accession number is

assigned to this collection of sequences and each record includes a clear

indication of the status (phase 1 or 2) plus a prominent warning that the

sequence data is "unfinished" and may contain errors. The accession

number does not change as sequence records are updated; only the most

recent version of a HTG record remains in GenBank. 'Finished' HTG

sequences (phase 3) retain the same accession number, but are moved into

the relevant primary GenBank division. An example of a submission (one

accession number) that has progressed through phase 1, phase 2, and

phase 3 is available

TIGR Microbial Database

A listing of microbial genomes that have been published or are in the process of being sequenced.

Published microbial genomes

	Link	Genome	Strain	Domain	Size (Mb)	Institution	Funding	Publication
1		Haemophilus influenzae Rd	KW20	B	1.83	TIGR	TIGR	Fleischmann et. al., Science 269:496-512 (1995)
2		*Mycoplasma genitalium*	G-37	B	0.58	TIGR	DOE	Fraser et. al., Science 270:397-403 (1995)
3		*Methanococcus jannaschii*	DSM 2661	A	1.66	TIGR	DOE	Bult et. al., Science 273:1058-1073 (1996)
4		*Synechocystis* sp.	PCC 6803	B	3.57	Kazusa DNA Research Inst.		Kaneko et. al., DNA Res. 3: 109-136 (1996)
5		*Mycoplasma pneumoniae*	M129	B	0.81	Univ. of Heidelberg		Himmelreich et. al., Nuc. Acid Res. 24:4420-4449 (1996)
6		*Saccharomyces cerevisiae*	S288C	E	13	International Consortium	EC, NHGRI, Welcome Trust, McGill U., RIKEN	Goffeau et. al., Nature 387 (Suppl.) 5-105 (1997)
7		*Helicobacter pylori*	26695	B	1.66	TIGR	TIGR	Tomb et. al., Nature388:539-547 (1997)
8		*Escherichia coli*	K-12	B	4.60	University of Wisconsin	NHGRI	Blattner et. al., Science 277:1453-1474 (1997)
9		*Methanobacterium thermoautotrophicum*	delta H	A	1.75	Genome Therapeutics & Ohio State Univ.	DOE	Smith et.al., J. Bacteriology, 179:7135-7155 (1997)
10		*Bacillus subtilis*	168	B	4.20	International Consortium	EC	Kunst et.al., Nature390: 249-256 (1997)
11		*Archaeoglobus fulgidus*	VC-16, DSM4304	A	2.18	TIGR	DOE	Klenk et al.,Nature 390:364-370 (1997)
12		*Borrelia burgdorferi*	B31	B	1.44	TIGR	Mathers Foundation	Fraser et al., Nature, 390: 580-586 (1997)

Microbial genomes in progress

Genome	Strain	Domain	Size (Mb)	Institution	Funding	Anticipated Publication
*Actinobacillus actinomycetemcomitans*		B	2.2	University of Oklahoma	NIDR
*Aquifex aeolicus*	VF5	B	1.50	Diversa		manuscript submitted
*Bartonella henselae*	Houston 1	B	2.00	University of Uppsala	SSF	1999
*Caulobacter crescentus*		B	3.80	TIGR	DOE
*Chlamydia pneumoniae*		B	1.00	TIGR	NIAID
*Chlamydia trachomatis mouse pneumonitis*		B	1.00	TIGR	NIAID
*Chlamydia trachomatis*	serovar D (D/UW-3/Cx)	B	1.05	UC Berkeley & Stanford	NIAID
*Chlorobium tepidum*		B	2.10	TIGR	DOE
*Clostridium acetobutylicum*	ATCC 824	B	4.1	Genome Therapeutics	DOE
*Deinococcus radiodurans*	R1	B	3.00	TIGR	DOE	1998
*Dehalococcoides ethenogenes*		B	?	TIGR	DOE
*Desulfovibrio vulgaris*		B	1.70	TIGR	DOE	<!--<TR ALIGN=CENTER> <TD></TD> <TD><B><I>Ehrlichia species</I></B> (HGE agent)</TD> <TD><BR></TD> <TD><FONT SIZE=5><B>B</B></FONT></TD> <TD>1.40 </TD> <TD><B>TIGR</B></TD> </TR>-->
*Enterococcus faecalis*		B	3.00	TIGR	NIAID	1999
*Francisella tularensis*	schu 4	B	2.00	European & North American consortium
*Halobacterium sp.*	NRC-1	A	2.50	University of Massachusetts / University of Washington
*Halobacterium salinarium*		A	4.0	Max-Planck-Institute for Biochemistry
*Legionella pneumophila*		B	4.10	TIGR
*Mycobacterium avium*	Please read	B	4.70	TIGR	NIAID	2000
*Mycobacterium tuberculosis*	CSU#93 (clinical isolate)	B	4.40	TIGR	NIAID	1998
*Mycobacterium tuberculosis*	H37Rv (lab strain)	B	4.40	Sanger Centre	Wellcome Trust
*Mycoplasma mycoides subsp. mycoides SC*	PG1	B	1.28	The Royal Institute of Technology, Stockholm & The National Veterinary Institute, Uppsala		1999
*Neisseria gonorrhoeae*		B	2.20	University of Oklahoma	NIAID
*Neisseria meningitidis*	MC58	B	2.30	TIGR	TIGR
*Neisseria meningitidis*	serogroup A strain Z2491	B	2.30	Sanger Centre	Wellcome Trust
*Plasmodium falciparum* Chr1 (isolate 3D7)		E	0.8	Sanger Centre	Wellcome Trust
*Plasmodium falciparum* Chr2 (isolate 3D7)		E	1.00	TIGR / NMRI	NIAID	1998
*Plasmodium falciparum* Chr3 (isolate 3D7)		E	1.20	Sanger Centre	Wellcome Trust
*Plasmodium falciparum* Chr4 (isolate 3D7)		E	1.5	Sanger Centre	Wellcome Trust
*Plasmodium falciparum* Chr9 (isolate 3D7)		E	1.80	TIGR / NMRI
*Plasmodium falciparum* Chr10 (isolate 3D7)		E	2.10	TIGR / NMRI
*Plasmodium falciparum* Chr12 (isolate 3D7)		E	2.4	Stanford University	Burroughs Wellcome Fund
*Plasmodium falciparum* Chr14 (isolate 3D7)		E	3.4	TIGR / NMRI	Burroughs Wellcome Fund
*Porphyromonas gingivalis*	W83	B	2.20	TIGR / Forsyth Dental Center	NIDR	1999
*Pseudomonas aeruginosa*	PAO1	B	5.90	University of Washington PathoGenesis	Cystic Fibrosis Foundation PathoGenesis
*Pseudomonas putida*		B	5.00	TIGR	DOE
*Pyrobaculum aerophilum*		A	2.22	Caltech / UCLA	ONR / DOE
*Pyrococcus furiosus*		A	2.10	Center of Marine Biotechnology / Univ. Utah	DOE
*Pyrococcus horikoshii (shinkaj)*	OT3	A	1.80	NITE
*Rhodobacter capsulatus*	SB1003	B	1.80	University of Chicago		1998
*Rickettsia prowazekii*	Madrid E	B	1.10	University of Uppsala	SSF / NFR	1998
*Salmonella typhimurium*		B	4.50	TIGR
*Shewanella putrefaciens*	MR-1	B	4.50	TIGR	DOE
*Staphylococcus aureus*		B	2.80	TIGR	NIAID / MGRI
*Streptococcus pneumoniae*	type 4	B	2.20	TIGR	TIGR / NIAID / MGRI	1998
*Streptococcus pyogenes*		B	1.98	University of Oklahoma	NIAID
*Streptomyces coelicolor*	A3(2)	B	8.0	Sanger Centre / John Innes Centre	BBSRC
*Sulfolobus solfataricus*		A	3.05	Canadian & European Consortium
*Thermotoga maritima*	MSB8	B	1.80	TIGR	DOE	1998
*Thiobacillus ferroxidans*		B	2.90	TIGR	DOE
*Treponema denticola*		B	3.00	TIGR / Univ. Texas
*Treponema pallidum*	Nichols	B	1.14	TIGR / Univ. Texas	NIAID	manuscript in preparation
*Thermoplasma acidophilum*		A	1.7	Max-Planck-Institute for Biochemistry
*Ureaplasma urealyticum*	serovar 3	B	0.75	U. Alabama / PE-ABI	PE-ABI / NIH / UAB	1998
*Vibrio cholerae*	serotype O1, Biotype El Tor, strain N16961	B	2.50	TIGR	NIAID	1998
*Xylella fastidiosa*	8.1.b clone 9.a.5.c	B	2.00	Brazilian Consortium	FAPESP	2000

KEY

Domain

A: Archaea

B: Eubacteria

E: Eucaryote

Map of Mycoplasma genitalium genome.

The flow of genetic information

DNA -> RNA -> protein -> conformation

Translation products of DNA - Amino acids in three letter code

ValArgIleArgIleSerAsp

TyrGlyPheGlyPheArgMet

ThrAspSerAspPheGlyCys

5' GUACGGAUUCGGAUUUCGGAUGC 3'

3' CAUGCCUAAGCCUAAAGCCUACG 5'

TyrProAsnProAsnArgIle

ValSerGluSerLysProHis

ArgIleArgIleGluSerAla

Amino acids in one letter code

V R I R I S D

Y G F G F R M

T D S D F G C

5' GUACGGAUUCGGAUUUCGGAUGC 3'

3' CAUGCCUAAGCCUAAAGCCUACG 5'

Y P N P N R I

V S E S K P H

R I R I E S A

Three- and one-letter codes of the amino acids.

Alanine Ala A

Arginine Arg R

Asparagine Asn N

Aspartate Asp D

Cysteine Cys C

Glutamate Glu E

Glutamine Gln Q

Glycine Gly G

Histidine His H

Isoleucine Ile I

Leucine Leu L

Lysine Lys K

Metionine Met M

Fenylalanine Phe F

Proline Pro P

Serine Ser S

Treonine Thr T

Tryptofan Trp W

Tyrosine Tyr Y

Valine Val V

4. THE GENETIC CODE

UUU Phe UCU Ser UAU Tyr UGU Cys

UUC Phe UCC Ser UAC Tyr UGC Cys

UUA Leu UCA Ser UAA Stop UGA Stop

UUG Leu UCG Ser UAG Stop UGG Trp

CUU Leu CCU Pro CAU His CGU Arg

CUC Leu CCC Pro CAC His CGC Arg

CUA Leu CCA Pro CAA Gln CGA Arg

CUG Leu CCG Pro CAG Gln CGG Arg

AUU Ile ACU Thr AAU Asn AGU Ser

AUC Ile ACC Thr AAC Asn AGC Ser

AUA Ile ACA Thr AAA Lys AGA Arg

AUG Met ACG Thr AAG Lys AGG Arg

GUU Val GCU Ala GAU Asp GGU Gly

GUC Val GCC Ala GAC Asp GGC Gly

GUA Val GCA Ala GAA Glu GGA Gly

GUG Val GCG Ala GAG Glu GGG Gly

Table I. The genetic code

Sequence symbols: Nucleotides

Symbol Meaning Complement

A A T

C C G

G G C

T/U T A

M A or C K

R A or G Y

W A or T W

S C or G S

Y C or T R

K G or T M

V A or C or G B

H A or C or T D

D A or G or T H

B C or G or T V

X/N G or A or T or C X

. not G or A or T or C .

Deviations from the standard genetic code

# Cilian protozoa

UAA = Gln:Q

UAG = Gln:Q

# Yeast mitochondria

UGA = Trp:W

CUU = Thr:T

CUC = Thr:T

CUA = Thr:T

CUG = Thr:T

AUA = Met:M

# Mammalian mitochondria

UGA = Trp:W

AUU = Ile:I

AUC = Ile:I

AUA = Met:M

AGA = * :*

AGG = * :*

# Drosophila mitochondria

UGA = Trp:W

AUU = Ile:I

AUA = Met:M

AGA = Ser:S

AGG = Ser:S

# mycoplasma

UGA = Trp

Codon usage for enteric bacterial (highly expressed) genes 7/19/83

AmAcid Codon Number /1000 Fraction ..

Gly GGG 13.00 1.89 0.02

Gly GGA 3.00 0.44 0.00

Gly GGU 365.00 52.99 0.59

Gly GGC 238.00 34.55 0.38

Glu GAG 108.00 15.68 0.22

Glu GAA 394.00 57.20 0.78

Asp GAU 149.00 21.63 0.33

Asp GAC 298.00 43.26 0.67

Val GUG 93.00 13.50 0.16

Val GUA 146.00 21.20 0.26

Val GUU 289.00 41.96 0.51

Val GUC 38.00 5.52 0.07

Ala GCG 161.00 23.37 0.26

Ala GCA 173.00 25.12 0.28

Ala GCU 212.00 30.78 0.35

Ala GCC 62.00 9.00 0.10

Arg AGG 1.00 0.15 0.00

Arg AGA 0.00 0.00 0.00

Ser AGU 9.00 1.31 0.03

Ser AGC 71.00 10.31 0.20

Lys AAG 111.00 16.11 0.26

Lys AAA 320.00 46.46 0.74

Asn AAU 19.00 2.76 0.06

Asn AAC 274.00 39.78 0.94

Met AUG 170.00 24.68 1.00

Ile AUA 1.00 0.15 0.00

Ile AUU 70.00 10.16 0.17

Ile AUC 345.00 50.09 0.83

Thr ACG 25.00 3.63 0.07

Thr ACA 14.00 2.03 0.04

Thr ACU 130.00 18.87 0.35

Thr ACC 206.00 29.91 0.55

Trp UGG 55.00 7.98 1.00

End UGA 0.00 0.00 0.00

Cys UGU 22.00 3.19 0.49

Cys UGC 23.00 3.34 0.51

End UAG 0.00 0.00 0.00

End UAA 0.00 0.00 0.00

Tyr UAU 51.00 7.40 0.25

Tyr UAC 157.00 22.79 0.75

Leu UUG 18.00 2.61 0.03

Leu UUA 12.00 1.74 0.02

Phe UUU 51.00 7.40 0.24

Phe UUC 166.00 24.10 0.76

Ser UCG 14.00 2.03 0.04

Ser UCA 7.00 1.02 0.02

Ser UCU 120.00 17.42 0.34

Ser UCC 131.00 19.02 0.37

Arg CGG 1.00 0.15 0.00

Arg CGA 2.00 0.29 0.01

Arg CGU 290.00 42.10 0.74

Arg CGC 96.00 13.94 0.25

Gln CAG 233.00 33.83 0.86

Gln CAA 37.00 5.37 0.14

His CAU 18.00 2.61 0.17

His CAC 85.00 12.34 0.83

Leu CUG 480.00 69.69 0.83

Leu CUA 2.00 0.29 0.00

Leu CUU 25.00 3.63 0.04

Leu CUC 38.00 5.52 0.07

Pro CCG 190.00 27.58 0.77

Pro CCA 36.00 5.23 0.15

Pro CCU 19.00 2.76 0.08

Pro CCC 1.00 0.15 0.00

Swissprot

The SWISS-PROT Protein Sequence Data Bank is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. Itcontains high-quality annotation, is non-redundant, and cross-referenced to many other databases.

Release 35.0 of SWISS-PROT contains 69'113 sequence entries, comprising 25'083'768 amino acids abstracted from 59'101 references.

SWISS-PROT is accompanied by TREMBL, a computer-annotated supplement to SWISS-PROT. TREMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT. TREMBL can be considered as a preliminary section of SWISS-PROT as all TREMBL entries have been assigned SWISS-PROT accession numbers. TREMBL is split into two main sections; SP-TREMBL and REM-TREMBL. SP-TREMBL (SWISS-PROT TREMBL) contains the entries which should eventually be incorporated into SWISS-PROT.

Release 5 of TREMBL is created from release 53 of the EMBL nucleotide sequence database and contains 166'361 sequence entries, comprising 45'671'684 amino acids.

General Documents

· User Manual

· Release Notes

· Current SWISS-PROT release

· Old SWISS-PROT release notes

· Current TREMBL release notes

· Old TREMBL release notes

· SWISS-PROT Documentation

· Publications of the EBI SWISS-PROT Team.

· How to contact SWISS-PROT

Swissprot entry

ID PRIO_HUMAN STANDARD; PRT; 253 AA.

AC P04156;

DT 01-NOV-1986 (REL. 03, CREATED)

DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)

DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)

DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).

GN PRNP.

OS HOMO SAPIENS (HUMAN).

OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;

OC EUTHERIA; PRIMATES.

RN [1]

RP SEQUENCE FROM N.A.

RX MEDLINE; 86300093.

RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H.,

RA PRUSINER S.B., DEARMOND S.J.;

RL DNA 5:315-324(1986).

RN [2]

RP SEQUENCE OF 8-253 FROM N.A.

RX MEDLINE; 86261778.

RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.;

RL SCIENCE 233:364-367(1986).

RN [3]

RP VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150.

RX MEDLINE; 91160504.

RA TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D.,

RA PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.;

RL EMBO J. 10:513-519(1991).

RN [4]

RP REVIEW ON VARIANTS.

RX MEDLINE; 93372867.

RA PALMER M.S., COLLINGE J.;

RL HUM. MUTAT. 2:168-173(1993).

RN [5]

RP REVIEW ON VARIANTS.

RX MEDLINE; 94029646.

RA PRUSINER S.B.;

RL ARCH. NEUROL. 50:1129-1153(1993).

RN [6]

RP VARIANT GSS LEU-102.

RX MEDLINE; 89159432.

RA HSIAO K., BAKER H.F., CROW T.J., POULTER M., OWEN F.,

RA TERWILLIGER J.D., WESTAWAY D., OTT J., PURSINER S.B.;

RL NATURE 338:342-345(1989).

RN [7]

RP VARIANTS LEU-102; VAL-117 AND VAL-129.

RX MEDLINE; 89392018.

RA DOH-URA K., TATEISHI J., SASAKI H., KITAMOTO T., SAKAKI Y.;

RL BIOCHEM. BIOPHYS. RES. COMMUN. 163:974-979(1989).

RN [8]

RP VARIANT FFI ASN-178.

RX MEDLINE; 92195483.

RA MEDORI R., MONTAGNA P., TRITSCHLER H.J., LEBLANC A., CORTELLI P.,

RA TINUPER P., LUGARESI E., GAMBETTI P.;

RL NEUROLOGY 42:669-670(1992).

RN [9]

RP VARIANT CJD ASN-178.

RX MEDLINE; 91124933.

RA GOLDFARB L.G., HALTIA M., BROWN P., NIETO A., KOVANEN J.,

RA MCCOMBIE W.R., TRAPP S., GAJDUSEK D.C.;

RL LANCET 337:425-425(1991).

RN [10]

RP VARIANT CJD LYS-200.

RX MEDLINE; 90355709.

RA GOLDFARB L., MITROVA E., BROWN P., TOH B.K., GAJDUSEK D.C.;

RL LANCET 336:514-515(1990).

RN [11]

RP VARIANT GSS ARG-217.

RX MEDLINE; 93250977.

RA HSIAO K., DLOUHY S.R., FARLOW M.R., CASS C., DA COSTA M.,

RA CONNEALLY P.M., HODES M.E., GHETTI B., PRUSINER S.B.;

RL NAT. GENET. 1:68-71(1992).

RN [12]

RP VARIANTS CJD ILE-180 AND ARG-223.

RX MEDLINE; 93213314.

RA KITAMOTO T., OHTA M., DOH-URA K., HITOSHI S., TERAO Y., TATEISHI J.;

RL BIOCHEM. BIOPHYS. RES. COMMUN. 191:709-714(1993).

RN [13]

RP VARIANT CJD ILE-210.

RX MEDLINE; 94071412.

RA POCCHIARI M., SALVATORE M., CUTRUZZOLA F., GENUARDI M.,

RA ALLCATELLI C.T., MASULLO C., MACCHI G., ALEMA G., GALGANI S., XI Y.G.,

RA PETRAROLI R., SILVESTRINI M.C., BRUNORI M.;

RL ANN. NEUROL. 34:802-807(1993).

RN [14]

RP VARIANT GSS LEU-105.

RX MEDLINE; 94077414.

RA YAMADA M., ITOH Y., FUJIGASAKI H., NARUSE S., KANEKO K., KITAMOTO T.,

RA TATEISHI J., OTOMO E., HAYAKAWA M., TANAKA J., MATSUSHITA M.,

RA MIYATAKE T.;

RL NEUROLOGY 43:2723-2724(1993).

RN [15]

RP VARIANT GSS LEU-105.

RX MEDLINE; 95213742.

RA ITOH Y., YAMADA M., HAYAKAWA M., SHOZAWA T., TANAKA J., MATSUSHITA M.,

RA KITAMOTO T., TATEISHI J., OTOMO E.;

RL J. NEUROL. SCI. 127:77-86(1994).

RN [16]

RP VARIANT CJD LYS-200.

RX MEDLINE; 94142912.

RA INOUE I., KITAMOTO T., DOH-URA K., SHII H., GOTO I., TATEISHI J.;

RL NEUROLOGY 44:299-301(1994).

RN [17]

RP VARIANT CJD LYS-200.

RX MEDLINE; 94316708.

RA GABIZON R., ROSENMAN H., MEINER Z., KAHANA I., KAHANA E., SHUGART Y.,

RA OTT J., PRUSINER S.B.;

RL PHILOS. TRANS. R. SOC. LOND., B, BIOL. SCI. 343:385-390(1994).

RN [18]

RP VARIANT GSS LEU-102.

RX MEDLINE; 95303274.

RA YOUNG K., JONES C.K., PICCARDO P., LAZZARINI A., GOLBE L.I.,

RA ZIMMERMAN T.R., DICKSON D.W., MCLACHLAN D.C., ST GEORGE-HYSLOP P.,

RA LENNOX A.;

RL NEUROLOGY 45:1127-1134(1995).

CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THE

CC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.

CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLED

CC "RODS".

CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.

CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS AND

CC ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN AS

CC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE:

CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROME

CC (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIE

CC IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) IN

CC CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTING

CC DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORM

CC ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHY

CC (EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATE

CC THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)

CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,

CC EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTED

CC FOODSTUFFS.

CC -!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PER

CC MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OF

CC CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTH

CC HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHIC

CC ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TO

CC IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THE

CC PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURES

CC THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORM

CC DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTEN

CC APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS,

CC AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BY

CC PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS IN

CC MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OF

CC HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES.

CC THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS.

CC -!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS A

CC "SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS".

CC GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION.

CC -!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONG

CC NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUS

CC MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THE

CC LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA IS

CC CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTH

CC AFTER ONSET.

CC -!- SIMILARITY: TO OTHER PRP.

CC -!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry;

CC WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm".

DR EMBL; M13667; G190470; -.

DR EMBL; M13899; G190468; -.

DR EMBL; D00015; G220016; -.

DR PIR; A05017; A05017.

DR PIR; A24173; A24173.

DR PIR; S14078; S14078.

DR MIM; 176640; -.

DR MIM; 123400; -.

DR MIM; 137440; -.

DR MIM; 245300; -.

DR MIM; 600072; -.

DR PROSITE; PS00291; PRION_1; 1.

DR PROSITE; PS00706; PRION_2; 1.

KW PRION; BRAIN; GLYCOPROTEIN; GPI-ANCHOR; REPEAT; SIGNAL;

KW POLYMORPHISM; DISEASE MUTATION.

FT SIGNAL 1 22

FT CHAIN 23 230 MAJOR PRION PROTEIN.

FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY).

FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY).

FT CARBOHYD 181 181 PROBABLE.

FT CARBOHYD 197 197 PROBABLE.

FT DISULFID 179 214 BY SIMILARITY.

FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G-

FT Q.

FT REPEAT 51 59 1.

FT REPEAT 60 67 2.

FT REPEAT 68 75 3.

FT REPEAT 76 83 4.

FT REPEAT 84 91 5.

FT VARIANT 102 102 P -> L (IN GSS).

FT VARIANT 105 105 P -> L (IN GSS).

FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OF

FT DEMENTING GSS).

FT VARIANT 129 129 M -> V (DETERMINES THE DISEASE PHENOTYPE

FT IN PATIENTS WHO HAVE A PRP MUTATION AT

FT CODON 178: PATIENTS WITH MET DEVELOP FFI,

FT THOSE WITH VAL DEVELOP CJD).

FT VARIANT 178 178 D -> N (IN FFI AND CJD).

FT VARIANT 180 180 V -> I (IN CJD).

FT VARIANT 198 198 F -> S (IN A ATYPICAL FORM OF GSS WITH

FT NEUROFIBRILLARY TANGLES).

FT VARIANT 200 200 E -> K (IN CJD).

FT VARIANT 210 210 V -> I (IN CJD).

FT VARIANT 217 217 Q -> R (IN GSS WITH NEUROFIBRILLARY

FT TANGLES).

FT VARIANT 232 232 M -> R (IN CJD).

FT CONFLICT 118 118 MISSING (IN REF. 2).

SQ SEQUENCE 253 AA; 27661 MW; FD5373AD CRC32;

MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQP

HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGA

VVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV

NITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV

ILLISFLIFL IVG

PROSITE entries.

Example I

ID ATP_GTP_A; PATTERN.

AC PS00017;

DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); NOV-1990 (INFO UPDATE).

DE ATP/GTP-binding site motif A (P-loop).

PA [AG]-x(4)-G-K-[ST].

CC /TAXO-RANGE=ABEPV;

3D 1EFM; 1ETU; 1Q21; 2Q21; 4Q21; 5Q21; 6Q21;

DO PDOC00017;

Example II

ID ZINC_FINGER_C2H2; PATTERN.

AC PS00028;

DT APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1997 (INFO UPDATE).

DE Zinc finger, C2H2 type, domain.

PA C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H.

NR /RELEASE=35,69113;

NR /TOTAL=1932(412); /POSITIVE=1891(372); /UNKNOWN=6(6); /FALSE_POS=35(34);

NR /FALSE_NEG=3; /PARTIAL=1;

CC /TAXO-RANGE=??E?V; /MAX-REPEAT=37;

CC /SITE=1,zinc; /SITE=3,zinc; /SITE=7,zinc; /SITE=9,zinc;

DR P21192, ACE2_YEAST, T; P07248, ADR1_YEAST, T; P39413, AEF1_DROME, T;

DR Q00900, AGIE_RAT , T; P41696, AZF1_YEAST, T; Q01954, BASO_HUMAN, T;

DR P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; P55201, BR14_HUMAN, T;

DR Q01295, BRC1_DROME, T; Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T;

DR P10069, BRLA_EMENI, T; Q01713, BTEB_RAT , T; Q01522, CF23_DROME, T;

DR P20385, CF2_DROME , T; P19538, CID_DROME , T; Q05620, CREA_ASPNG, T;

DR Q01981, CREA_EMENI, T; Q08705, CTCF_CHICK, T; P49711, CTCF_HUMAN, T;

DR P36197, DEFI_CHICK, T; P23792, DISC_DROME, T; P26632, EGR1_BRARE, T;

DR P18146, EGR1_HUMAN, T; P08046, EGR1_MOUSE, T; P08154, EGR1_RAT , T;

DR Q05159, EGR2_BRARE, T; P26633, EGR2_CRILO, T; P26634, EGR2_DUSTH, T;

DR P11161, EGR2_HUMAN, T; P08152, EGR2_MOUSE, T; P26635, EGR2_POERE, T;

DR P51774, EGR2_RAT , T; Q08427, EGR2_XENLA, T; Q06889, EGR3_HUMAN, T;

DR P43300, EGR3_MOUSE, T; P43301, EGR3_RAT , T; Q05215, EGR4_HUMAN, T;

. ( I edited out a lot of very interesting information here ...)

DR P49782, S3AE_BACSU, F; P24804, TA29_TOBAC, F; P36810, VE6_HPV32 , F;

DR P15024, VL1_REOVD , F; P03527, VSI3_REOVD, F; P30211, VSI3_REOVJ, F;

DR P07939, VSI3_REOVL, F; Q93098, WN8B_HUMAN, F; P51028, WNT8_BRARE, F;

DR P28026, WNT8_XENLA, F; P20201, Y15K_SSV1 , F; P20198, Y5K6_SSV1 , F;

DR P43558, YFE4_YEAST, F; P37127, YFFG_ECOLI, F; P38890, YH07_YEAST, F;

DR Q09441, YP83_CAEEL, F;

3D 1ARD; 1ARE; 1ARF; 1PAA; 1ZAA; 1AAY; 2GLI; 1SP1; 1SP2; 1NCS; 1ZFD; 1TF3;

3D 2DRP; 1ZNF; 3ZNF; 4ZNF; 1BBO; 5ZNF; 7ZNF;

DO PDOC00028;

PROSITE documentation

{PDOC00004}

{PS00004; CAMP_PHOSPHO_SITE}

{BEGIN}

****************************************************************

* cAMP- and cGMP-dependent protein kinase phosphorylation site *

****************************************************************

There has been a number of studies relative to the specificity of cAMP- and

cGMP-dependent protein kinases [1,2,3]. Both types of kinases appear to share

a preference for the phosphorylation of serine or threonine residues found

close to at least two consecutive N-terminal basic residues. It is important

to note that there are quite a number of exceptions to this rule.

-Consensus pattern: [RK](2)-x-[ST]

[S or T is the phosphorylation site]

-Last update: June 1988 / First entry.

[ 1] Fremisco J.R., Glass D.B., Krebs E.G.

J. Biol. Chem. 255:4240-4245(1980).

[ 2] Glass D.B., Smith S.B.

J. Biol. Chem. 258:14797-14803(1983).

[ 3] Glass D.B., El-Maghrabi M.R., Pilkis S.J.

J. Biol. Chem. 261:2987-2993(1986).

{END}

Enzyme

www.bis.med.jhmi.edu/Dan/proteins/ec-enzyme.html

The current release has 3650 entries and was indexed 01-Mar-1998.

Description

The ENZYME data bank contains the following data for each type of

characterized enzyme for which an EC number has been provided: EC

number, Recommended name, Alternative names, Catalytic activity,

Cofactors, Pointers to the SWISS-PROT entrie(s) that correspond to the

enzyme, Pointers to disease(s) associated with a deficiency of the enzyme.

Literature

Bairoch A. (1993) The ENZYME data bank. Nucleic Acids Res, Jul

1;21(13):3155-6.

Taxonomy database

www3.ncbi.nlm.nih.gov/Taxonomy/tax.html

This is the top level of the taxonomy database maintained by

NCBI/GenBank. You can explore any of the taxa listed below by clicking it.

Archaea

Eubacteria

Eukaryotae

Viroids

Viruses

Other

Unclassified

This is a searchable index. You can enter the name of superspecific taxa (e.g., Porifera) or

the name of a particular organism (e.g., Thalarctos maritimus for the polar bear or polar

bear itself).

Query:

Use query string as : Complete match Regular expression Set of tokens

The "Set of tokens" option returns longer names that include the search terms, e.g., hybrid taxa.

See what happens if you query "Bos taurus" using the "Complete match" option versus the "Set of

tokens" option.

These are direct links to some of the organisms most commonly used in

molecular research projects:

Arabidopsis thaliana

Caenorhabditis elegans

Danio rerio (zebrafish)

Drosophila

Escherichia coli

Hepatitis C virus

Homo sapiens

Mus musculus

Mycoplasma

Oryza sativa

Plasmodium falciparum

Pneumocystis carinii

Rattus

Saccharomyces cerevisiae

Schizosaccharomyces pombe

Xenopus laevis

Molecular biology databases – text searches and retrieval of data.

With Internet tools such as

- Entrez

- Sequence Retrieval System (SRS)

you can search the annotation section of molecular biology databases using one or more words and by selecting specific fields in the database. For instance to find all insulin proteins in a protein database you can search Entrez – protein database , use ”insulin” as search word and selecting ”All fields” or ”Protein Name”.

Tutorials for Entrez and SRS

http://www3.ncbi.nlm.nih.gov/Entrez/entrezhelp.html

http://srs.ebi.ac.uk:5000/srs5/man/srsman.html
Fields that may be specified in the nucleotide, protein and Pubmed databases of Entrez:

SRS servers

WEHI, Melbourne, Australia

Belgian EMBnet Node (BEN), Brussels, Belgium

IBMM-DBM, Université Libre, Brussels, Belgium

DBBM-IOC, Fiocruz, Rio de Janeiro, Brazil

The Genome Mine, Base4 Bioinformatics, Canada

CBI EMBnet Node, University of Beijing, China

CSC, Otaniemi, Espoo, Finland

INFOBIOGEN, Villejuif, France

Institut Pasteur, Paris, France

DKFZ, Heidelberg, Germany

EMBL, Heidelberg, Germany www.embl-heidelberg.de:80/srs5/

GBF, Braunschweig, Germany

INCBI EMBnet Node, Dublin, Ireland

Weizmann Institute BCD, Rehovot, Israel

IVR, Kyoto University, Japan

Biotek EMBnet Node, Oslo, Norway

BIC, National University Hospital, Singapore

Biomedical Centre (BMC), Uppsala, Sweden

ExPASy, Geneva, Switzerland

CAOS/CAMM Center, Nijmegen, The Netherlands

Adlib, CAB International, Wallingford, UK

EMBL-EBI, Hinxton, Cambridge, UK srs.ebi.ac.uk:5000/srs5/

HGMP-RC, Hinxton, Cambridge, UK

MBDC Oxford, Oxford University, UK

SEQNET EMBnet Node, Daresbury, UK

Sanger Centre, Hinxton, Cambridge, UK

IUBio, Indiana University, USA

SRSWWW

(http://srs.ebi.ac.uk:5000/srs5/man/srsman.html)

Introduction

SRSWWW is a World Wide Web interface to the Sequence Retrieval System (SRS). Compared with other SRS interfaces SRSWWW has more users because of the widespread use of web browsers, its easiness to handle and its user friendly interface. It supports HTML3 and lesser versions of HTML. Since HTTP is stateless, a user ID is created to keep the state for the whole session.

A SRS session is started by clicking on the 'Start' button in the SRS home page. The page appearing next is the 'Select Libraries Page' also referred to as 'Top Page'.

Most pages in the SRSWWW have six header buttons which contain links to other SRSWWW pages. One header button is the 'Top Page' which links to the start page of the SRSWWW. The other buttons point to the 'Query Form' page, the 'Query Manager' page, the 'View Manager' page, the 'Databanks' page and the 'Help' page.

SRSWWW	The Top Page
The 'Top Page' is used to select the one or more databank(s) to be searched. Selecting databank(s) can be done by clicking the check boxes. In all cases, the databank names are linked to an information page ('Information about ...' page). It provides some information about the indices of the respective databank. The data fields of the databank can be browsed directly (see the 'Browse Index' page). If you want to change databases for another search you have to go to the 'Top Page'.
Continue & Reset Button	The 'Reset' button can be used to deselect all selections done so far. Each databank can individually be deselected by clicking again their checkboxes. Clicking the 'Continue' button finishes the 'Top page' and brings up the 'Query Form' page.

Grouping the Databanks	In the 'Top Page' all available databanks are collected in groups defined by SRS.
User defined Databases	There is a possibility to include user defined databanks. For example with the search engine FASTA or BLAST special databanks can be created. Before such a databank can be selected it has to be created with a search (for instance with BLAST). An existing sequence can be inserted in the 'Enter query sequence' field by "cut and paste".

SRSWWW	The Query Form Page
With the 'Query Form' page the databank query can be defined. The selected databank(s) is/are listed at the top of the page. Which fields are listed depends on the 'Show only fields that selected databanks have in common ' check box from the Top Page and the searchable fields of the databank(s).
Do Query & Reset Button	The 'Reset' button can be used to deselect all selections done so far. Each selection can be deselected individually by clicking again the check box. Clicking on the 'Do Query' button starts the query and brings up the 'Query Result' page.
FIELD NAME, QUERY, INCLUDE IN LIST & RETRIEVE SUBENTRIES columns	The 'Field Name' selector lists all data fields of the selected databank which are linked to an information page. By clicking on "Info" after selecting a 'Field Name' one moves to the 'Browse Index' page. The 'Query' column is the input field for the query. By using the operators "&" (= AND), "!" (= NOT), "\|" (= OR) one can use more than one search expression in every 'Query' input field. The 'Include fields in output' column is a selection list for inclusion of the corresponding field in the query result.
Additional Menus	Some fields can have additional menus with "greater than" and "greater than or equal to" and "less than" and "less than or equal to" symbols (e.g., SeqLength) for their range selection. (e.g., seqLength 100 && seqLength < 500).
Combine Searches with	If more than one data field is to be searched the 'Query' input fields can be combined with the boolean operators 'AND', 'OR' and 'NOT'. Optionally, a wildcard is appended to every search string to enhance the possibility to find a match.
Chunk Size & Use view	Additional options are 'Entry list in chunks of' to define the number of entries shown per page and 'Use view' to select a predefined view with which the query result is shown (see the 'Create a new view' page for more information on 'Views'). A new view can be defined before doing a query. All the 'Views' which are applicable to the selected databanks are listed in the 'Use view' box and one of them can be selected. Note: only the 'Views' are listed which can be used with the selected databank(s). The default is 'Names Only' (i.e. without a special 'View').

SRSWWW	Browse Index Page
From the 'Query Form' page the 'Browse Index' page is opened by clicking on a 'Field Name' (e.g., the 'ID' field).
List Values Button	Submits a search string typed in the open box marked with a wildcard (the wildcard can be used in the string or deleted). The result of any matches (referred to as "values") of the search string in the indicated 'Field Name' is listed in the 'Values in the index of the ... field' page. In this list 'Values' can be selected and with the 'Make Query' button a query can be started immediately with the selected 'Value(s)'.

SRSWWW	The Query Result Page
The 'Query Result' page shows the result of a submitted query. The query string, the number of "hits" and found entries with check boxes and included fields (if selected in the 'Query Form' page or in the view definition) are indicated. The found entries are listed by the searched databank and the entry ID. Clicking on the entry opens the 'Single Entry' page. In addition, entries can be selected with the checkboxes. The 'Reset' button can be used to deselect all selections done so far. With the 'Link', 'Receive' and 'View' buttons further manipulations can be performed with the selected entries.
Link Button	With this button the whole/selected result entries can be checked whether they are linked to other databanks (see the 'Link Page' for more information).
Save Button	With the 'Receive' button the selected entries can be downloaded.
Launch View Button	Launch desired application for further analysis. With this button the result entries can be viewed with predefined 'View' definitions. The 'View' definitions can be selected out of the 'View' box.
Chunk Hypertext Links	The last line are hypertext links for any additional chunk set(s). These links are only present if the number of entries found is bigger than the number of the defined chunk size. The chunk set numbers starts from 1 to "number of hits" divided by "entry list chunk size" (selected in the 'Query Form' page). The user can jump to any additional chunkset by the hypertext link. The current chunk set is shown within braces. The left anglebracket hyperlink moves to the first chunkset whereas the right anglebracket hyperlink moves to the last chunkset.

SRSWWW	Single Entry Page
Every entry found and listed in the 'Query Result' page has a link to the complete entry in the databank. By clicking onto a entry, the 'Single Entry' page comes up and shows the complete entry. The complete entry can be saved or printed (refer to the documentation of your web browser).

SRSWWW	Link Page
From the 'Query Result' page one can go to the 'Link' page by pressing the 'Link' button. The 'Link' page is used to find the links between the found entries (the set) and the selected databanks. The user can choose one of three options to display the different entries. The 'Link' page lists all the databanks like the top page. The user should select the databanks which are different from the databanks selected for the query search and should select one of three options of the 'Find all entries' field. Clicking on the 'Continue' button shows the 'Query Result' page. A link can also be done in the 'Query Manager' page. (For more information about links refer to the SRS manual)
Find all entries:
in the selected DATABANKS which are linked to the current set	the result are entries out of the selected databanks, which are linked with the set.
in the current SET which are linked to all selected databanks	the result are entries out of the set, which have links to the selected databanks.
in the current set which are not linked to any of the selected databanks	the result are entries out of the set, which contain NO links to the selected databanks.

SRSWWW	Query Manager Page
The 'Query Manager' has two functions: one is to store the queries done so far. The queries are listed in a table. Every query is listed with a check box, the query name in the form 'Qn' (e.g., Q1, Q2, ..), a type (e.g., 'query' , 'link' ), the total number of entries found, the library name(s), the number of entries for each library, the query expression (SRS query syntax) and a comment. The user may add own commentary to the query. Users working in HTML3 mode get a table with all these descriptors. In versions below HTML3 the same information is available in different style and in different order. The second function of the 'Query Manager' is to make further queries and links. There are three functions to control the queries and three functions to do further queries and links.
Control Functions
Save	With the 'Save' function the selected queries can be downloaded.
Delete	The 'Delete' function deletes selected queries.
View	The 'View' function can be used to inspect a single query again or with a different 'View' (choose a 'View' from the 'Using the view' field).
Query/Link Functions
Link	The 'Link' function is exactly the same as in the 'Query Result' page, see detailed information in the chapter 'The Query Result Page'.
Combine	With this function it is possible to combine one or more queries with the logical operators AND, OR or NOT.
Expression	With this function the user can directly enter a query like "(Q1 & Q2) ! (Q3 \| Q4) PDB" (i.e. query 1 AND query 2 but NOT query 3 OR query 4 linked to the PDB database). In contrast to the 'Combine' function it allows to build more complex queries. For more information in using logical operators and link operators see the SRS manual.
Result Display Options
The 'Query Manager' offers two options to manipulate the 'Query Result' page:
Entry list in chunks of... & Using the view...	The options 'Entry list in chunks of ' and 'Using the view' are already described in the chapter about the 'Query Form' page.

The queries done so far will be lost after closing the SRS session. However, it is quite useful sometimes (especially in the case of complex query runs) to store a useful query set in a file (referred to as "history") though it can be loaded and used again in a new SRS session. This task is accomplished by activating one or more checkboxes on the bottom 'Save in histories queries of type' line.

SRSWWW

View Manager Page

The extraction of information from the databank(s) is done with a query created in the 'Query Form' page. In the 'View Manager' you can define how to look at the query results. You can use predefined 'Views' or create your own 'Views'. You can apply them on new created queries or already existing queries. A set of 'Views' can be defined before any query has been performed. But of course, a 'View' can only be used after a query has been created.

But the 'View Manager' can do a lot more than only to provide a special service on query sets as described above. The 'View Manager' works independently from the 'Query Form' routine. Thus, you can start complex queries within the 'View Manager' to investigate different relations between databanks and/or query sets.

The following options are offered in the 'View Manager' page: 'Create a new view', 'Edit selected view' or 'Delete selected view'. The 'View Manager' lists all the 'Views' defined so far in a table with View Name, Format, Root Libs, Root Fields, Leaf Libs and Leaf Fields. After doing a query on the selected database(s) the user gets a set of entries as a result. 'Views' can help to recognize whether these entries may have e.g., links to other databases. 'Views' can be configured to see the fields of interest from the set and from the linked database entries. There exist already predefined 'Views' which can be used directly or modified for your own purpose.

independent View Manager

Click on the link 'independent View Editor' with the middle or right mouse button; choose 'Open in New Window' in the appearing context menu. A new browser window is opened. By this way one can define several 'Views' at a time or edit a 'View' in one window and use it at once in the 'Query Result' page in the second window.

SRSWWW	Create A New View Page
In the 'Create A New View Page' the available databanks are listed in two columns. The databanks in the first column are called root databanks. The databanks in the second column are called leaf databanks. The root databanks refer to the databanks used for a particular query search. The leaf databank refer to the databanks which will be linked by the result set of the query search done with the root databanks. For example the user can define a view with SwissProt as root databank and Prosite as leaf databank. In the 'View Select' page the fields which have to be displayed from both databanks have to be selected. Choose this as a current view and make a query on SwissProt (which is the root databank). The result will be SwissProt entries and for entries with a link to Prosite the Prosite entries.
Name of new view	In the 'Name of view' field you can assign a name to the new 'View'. Leaving this field blank a 'View' name is created automatically, prefixed by the first root databank name and a view number at the end (e.g., Swissprot_view1, Prosite_view5 etc.).
Display view as table	With selecting the 'Display view as table' checkbox the result is displayed in a table form instead of a list form.
Print short field names in header	With this checkbox it is possible to place the short field names as the header of the table columns.
Show only fields common to selected root databanks	This checkbox is selected by default. So only common fields of the root databanks are listed.
Continue & Reset Button	The 'Reset' button can be used to deselect all selections done so far. Each field and checkbox can individually be deselected by clicking again. Clicking on the 'Continue' button finishes the 'Create new view' page and brings up the 'View Field Select' page.

SRSWWW	Edit Existing View Page
In this page a predefined 'View' can be edited. The 'Edit existing view' page offers the same editing options as the 'Create a new view' page.

SRSWWW	View Field Select Page
Lists the fields of all selected root and leaf databanks. Fields can be selected by their checkboxes and are displayed in the query. The display format of the sequence field can be selected by the format menu selection (e.g., FASTA, PIR, EMBL, Protein Chart etc.). The leaf databank selection has additional options:
Use query instead link	A query string can be inserted to search for special relations between the root and leaf databank.
Use view to display entries	The result of the 'View' can be shown by another 'View'.
Display only number of linked entries	Displays only the number of links which exist, not the links itself.
To submit the view one of the four options ('Top page', 'Query form', 'Query manager' or 'View manager') must be clicked.

Databanks Page

The 'Databanks' page lists all available databases in a table form. Every databank is listed as a hypertext link and shown with the release version number, the number of entries, the indexing date, the group (e.g., HomolSearch, Sequence etc.) and an availability flag. This flag indicates whether the respective databank is available (ok) or not for searches.

Help Page

Most of the information about the databanks and their fields description are available through hypertext links. Help is provided wherever possible in a context sensitive way. In addition to providing information, the data fields can be browsed directly (see the 'Browse Index' page).

Principles of sequence comparison

A common problem in molecular biology is to compare one sequence to another sequence or to a set of sequences. Previously we have searched databases for text information and retrieved entries. Sequence comparison, however, is a less trivial problem and in order to understand it a few theorical considerations are necessary.

Here we will consider four kinds of sequence comparison:

- Homology

- Pattern

- Multiple sequence alignment

- Profile

1. Homology

FASTA and BLAST are programs frequently used to search sequence databases for homology to a search sequence. Programs of this kind answers questions like : Is my sequence similar to anything in the database? Did I finally clone the gene that I’ve been trying to clone for the last three years? I seem to have identified a new protein, exactly what is the relationship of this protein to proteins that have been described previously?

One principle to be kept in mind that sequences are often searched as proteins although they are determined as DNA sequence. If you have a DNA sequence you may find it natural to search a DNA sequence database with this sequence. However, often it is more sensible to use a protein sequence as search sequence. That is, if you think your DNA sequence encodes a protein first you should try and identify all the possible translation products of the DNA. Then use the protein sequences to search a protein database (or a DNA database where are possible reading frames are examined, a ”translated DNA database”). Therefore, a common principle is that sequences are determined as DNA but compared as proteins. The reasons for this are mainly:

- The genetic code is degenerate. Consider for instance the two codons UCA and AGU. Both encode serine but are completely unrelated with respect to their sequence. Therefore, a sequence homology may be significant at the protein sequence level, but not detectable at the DNA level.

- All amino acid substitutions do not occur with the same frequency as further described below (substitution matrices). This is a useful principle when comparing proteins and there is no equivalent principle for nucleic acid sequences.

It is important to search nucleotide databases with proteins as query sequences because protein databases are not updated quickly enough. Furthermore, DNA databases contain potential protein coding regions that have not yet been identified. The only drawback with searching DNA databases with a protein sequence as query is that it is more time-consuming as all six reading frames have to be examined. TFASTA and tblastn are programs that do this sort of analysis. An even more time-consuming type of search is one where a DNA is used as query. All the six possible reading frames are used as queries to search the six reading frames of a DNA database (tblastx is one example of such a program, see below)

To understand how programs like FASTA and BLAST work let’s compare two related amino acid sequences:

seq 1 A L Q L I W G T S I R D K W P G D L

seq 2 A L Q M I W A G G T S I R D P G D L

We can represent them graphically like this:

SEQ 2

A *

L * * *

Q *

I * *

W *

A *

G * *

T *

S *

I * *

R *

D * *

P *

G * *

D * *

L * * *

A L Q L I W G T S I R D K W P G D L SEQ 1

All positions of amino acid identity are labeled by a ”*” (The GCG program COMPARE in combination with DOTPLOT produces such a plot). Visual inspection reveals a number of diagonals. Many computer programs designed to compare sequences identify such diagonals and tries to combine them into a more extensive alignment. In our example:

seq 1 A L Q L I W - - G T S I R D K W P G D L

| | | | | | | | | | | | | | |

seq 2 A L Q M I W A G G T S I R D - - P G D L

Thus, for a maximal alignment we have created gaps in each sequence.

A program like FASTA searches a database for sequence homology using a query sequence using the principle outlined above. FASTA uses parameters such as:

ktup or wordsize

- Length of initial peptide match, default is 2, i.e. the program starts identifying a diagonal by extending a dipeptide match. You may prefer to use ktup=1 if you want a more sensitive search. (This type of search takes a longer time, though!)

Gap penalty

Gaps are rarely allowed in evolution and a database search program is slow when you instruct it to examine every possible diagonal . For these reasons a gap penalty is assigned to gaps. These two parameters are frequently used in homology programs (such as the GCG programs GAP, BESTFIT, FASTA)

- Gap weight (Number of gaps)

- Gap length weight (Total length of gaps)

Substitution matrices

What is taken into accont in the figure above is only whether there is an exact match or not. However, we also have to into consideration whether some matches or mismatches are better than others. To decide on this we can assign a score to each amino acid pair with a high score to matches or mismatches that frequently occur in proteins and a low score to mismatches that seldom occur. To do this we make use of a table (substitution matrix) that looks something like this:

Substitution matrix (PAM250)

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 0 0 0 -8

R 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 -1 0 -1 -8

N 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 2 1 0 -8

D 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 3 3 -1 -8

C 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -4 -5 -3 -8

Q 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 1 3 -1 -8

E 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 3 3 -1 -8

G 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 0 0 -1 -8

H 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 1 2 -1 -8

I 5 2 -2 2 1 -2 -1 0 -5 -1 4 -2 -2 -1 -8

L 6 -3 4 2 -3 -3 -2 -2 -1 2 -3 -3 -1 -8

K 5 0 -5 -1 0 0 -3 -4 -2 1 0 -1 -8

M 6 0 -2 -2 -1 -4 -2 2 -2 -2 -1 -8

F 9 -5 -3 -3 0 7 -1 -4 -5 -2 -8

P 6 1 0 -6 -5 -1 -1 0 -1 -8

S 2 1 -2 -3 -1 0 0 0 -8

T 3 -5 -3 0 0 -1 0 -8

W 17 0 -6 -5 -6 -4 -8

Y 10 -2 -3 -4 -2 -8

V 4 -2 -2 -1 -8

B 3 2 -1 -8

Z 3 -1 -8

X -1 -8

* 1

The table was constructed from an alignment of proteins, such as cytochrome c from a large number of organisms. From each amino acid position in this alignment a probability is calculated for all the possible amino acid replacements, for instance the probability that serine replaces alanine. In nature that replacement is the result of one or more mutations at the DNA level. The table of probabilities was normalized (such that for every 100 amino acids an average of 1 mutation was accepted --> accepted point mutation = PAM100). The PAM100 table was calculated by comparing closely related sequences To estimate more distant relationships a PAM250 was created.) Essentials : Postive values reflect conservative changes, negative = changes that are unlikey to occur. W and C are amino acids that are particularly conserved.

PAM tables have been widely used. Recently however, BLOSUM (Blocks substitution matrices) tables have become more popular as they often seem to give a somewhat better result. BLOSUM tables were generated by comparing short conserved regions (BLOCKS) of proteins. BLOSUM matrices seem to favour hydrophilic matches and aromatic mismatches.

Low complexity masking

Filter out short repeats

For these parameters see BLAST below.

BLAST

Speed is increased compared to FASTA :

- Wordsize = 3-4

- No gaps are allowed.

BLAST is therefore less sensitive than FASTA and less suitable for comparing nucleotide sequences.

BLAST programs

blastp compares an amino acid query sequence against a

protein sequence database;

blastn compares a nucleotide query sequence against a

nucleotide sequence database;

blastx compares the six-frame conceptual translation

products of a nucleotide query sequence (both

strands) against a protein sequence database;

tblastn compares a protein query sequence against a

nucleotide sequence database dynamically

translated in all six reading frames (both

strands).

tblastx compares the six-frame translations of a nucleo-

tide query sequence against the six-frame transla-

tions of a nucleotide sequence database.

Some BLAST parameters:

EXPECT

The statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be foundmerely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leadingto fewer chance matches being reported. Fractional values are acceptable. (See parameter E in the BLAST Manual).

CUTOFF

Cutoff score for reporting high-scoring segment pairs.The default value is calculated from the EXPECT value(see above). HSPs are reported for a database sequenceonly if the statistical significance ascribed to themis at least as high as would be ascribed to a loneHSP having a score equal to the CUTOFF value. Higher CUTOFF values are more stringent, leading to fewerchance matches being reported. (See parameter S inthe BLAST Manual). Typically, significance thresholdscan be more intuitively managed using EXPECT.

MATRIX

Specify an alternate scoring matrix for BLASTP, BLASTX, TBLASTN and TBLASTX. The default matrix is BLOSUM62 (Henikoff & Henikoff, 1992). The valid alternative choices include: PAM40, PAM120, PAM250 and IDENTITY. No alternate scoring matrices are available for BLASTN; specifying the MATRIX directive in BLASTN requests returns an error response.

STRAND

Restrict a TBLASTN search to just the top or bottom strand of the database sequences; or restrict a BLASTN, BLASTX or TBLASTX search to just reading frames on the top or bottom strand of the query sequence.

FILTER

Mask off segments of the query sequence that have low compositional complexity, as determined by the

SEG program of Wootton & Federhen (Computers and Chemistry, 1993), or segments consisting of

short-periodicity internal repeats, as determined by the XNU program of Claverie & States (Computers

and Chemistry, 1993), or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation).

Filtering can eliminate statistically significant but biologically uninteresting reports from the blast

output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically

interesting regions of the query sequence available for specific matching against database sequences.

Low complexity sequence found by a filter program is substituted using the letter "N" in nucleotide sequence (e.g., "NNNNNNNNNNNNN") and the letter "X" in protein sequences (e.g., "XXXXXXXXX"). Users may turn off filtering by using the "Filter" option on the "Advanced options for the BLAST server" page.

Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

It is not unusual for nothing at all to be masked by SEG, XNU, or both, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect.

Example of BLAST output:

BLASTP 1.4.6MP [13-Jun-94] [Build 13:58:36 Sep 22 1994]

Reference: Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,

and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol.

215:403-10.

Query = pir|A01243|DXCH 232 Gene X protein - Chicken (fragment)

(232 letters)

Database: SWISS-PROT Release 29.0

38,303 sequences; 13,464,008 total letters.

Searching..................................................done

Observed Numbers of Database Sequences Satisfying

Various EXPECTation Thresholds (E parameter values)

Histogram units: = 31 Sequences : less than 31 sequences

EXPECTation Threshold

(E parameter)

V Observed Counts--

10000 4863 1861 |============================================================

6310 3002 782 |=========================

3980 2220 812 |==========================

2510 1408 303 |=========

1580 1105 393 |============

1000 712 179 |=====

631 533 161 |=====

398 372 80 |==

251 292 73 |==

158 219 50 |=

100 169 32 |=

63.1 137 18 |:

39.8 119 9 |:

25.1 110 6 |:

15.8 104 9 |:

Expect = 10.0, Observed = 95 <<<<<<<<<<<<<<<<

10.0 95 4 |:

6.31 91 3 |:

3.98 88 1 |:

2.51 87 3 |:

1.58 84 0 |

1.00 84 2 |:

Smallest

Sum

High Probability

Sequences producing High-scoring Segment Pairs: Score P(N) N

sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) (... 1191 7.7e-160 1

sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). 949 7.0e-127 1

sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN). 645 3.4e-100 2

sp|P19104|OVAL_COTJA OVALBUMIN. 626 1.2e-96 2

sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI). 216 3.7e-71 3

sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (... 325 4.0e-71 2

sp|P29508|SCCA_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN (SCC... 439 3.5e-70 2

sp|P30740|ILEU_HUMAN LEUKOCYTE ELASTASE INHIBITOR (LEI) (... 211 1.3e-66 3

sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, P... 176 1.8e-65 4

sp|P35237|PTI_HUMAN PLACENTAL THROMBIN INHIBITOR. 473 1.3e-61 1

sp|P29524|PAI2_RAT PLASMINOGEN ACTIVATOR INHIBITOR-2, T... 183 9.4e-61 4

sp|P12388|PAI2_MOUSE PLASMINOGEN ACTIVATOR INHIBITOR-2, M... 179 1.8e-60 4

sp|P36952|MASP_HUMAN MASPIN PRECURSOR. 198 2.6e-58 4

sp|P32261|ANT3_MOUSE ANTITHROMBIN-III PRECURSOR (ATIII). 142 4.0e-48 5

sp|P01008|ANT3_HUMAN ANTITHROMBIN-III PRECURSOR (ATIII). 122 7.5e-48 5

WARNING: Descriptions of 80 database sequences were not reported due to the

limiting value of parameter V = 15.

... alignments with the top 8 database sequences deleted ...

sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL (PAI-2)

(MONOCYTE ARG- SERPIN).

Length = 415

Score = 176 (80.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65

Identities = 38/89 (42%), Positives = 50/89 (56%)

Query: 1 QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNN 60

+I +LL S D DT +VLVNA+YFKG WKT F + PF V + PVQMM +

Sbjct: 180 KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSAQRTPVQMMYLRE 239

Query: 61 SFNVATLPAEKMKILELPFASGDLSMLVL 89

N+ + K +ILELP+A L+L

Sbjct: 240 KLNIGYIEDLKAQILELPYAGDVSMFLLL 268

Score = 165 (75.2 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65

Identities = 33/78 (42%), Positives = 47/78 (60%)

Query: 155 ANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFL 214

AN +G+S L +S+ H A ++++E+G E A TG + + QF ADHPFLFL

Sbjct: 338 ANFSGMSERNDLFLSEVFHQAMVDVNEEGTEAAAGTGGVMTGRTGHGGPQFVADHPFLFL 397

Query: 215 IKHNPTNTIVYFGRYWSP 232

I H T I++FGR+ SP

Sbjct: 398 IMHKITKCILFFGRFCSP 415

Score = 144 (65.6 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65

Identities = 26/62 (41%), Positives = 41/62 (66%)

Query: 90 LPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTD 149

+ D + LE +E I ++KL +WT+ + M + V+VY+PQ K+EE Y L S+L ++GM D

Sbjct: 272 IADVSTGLELLESEITYDKLNKWTSKDKMAEDEVEVYIPQFKLEEHYELRSILRSMGMED 331

Query: 150 LF 151

Sbjct: 332 AF 333

Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65

Identities = 10/17 (58%), Positives = 16/17 (94%)

Query: 81 SGDLSMLVLLPDEVSDL 97

+GD+SM +LLPDE++D+

Sbjct: 259 AGDVSMFLLLPDEIADV 275

WARNING: HSPs involving 86 database sequences were not reported due to the

limiting value of parameter B = 9.

FASTA

Two major parts of the program :

- Search in the database, the best 100-200 database sequences that best matches the query are saved.

- These sequences are carefully examined and the best possible alignments with the query sequence is determined. This step yields an optimized score, "opt". (FASTA optimizes only in the final step. However, it is possible to look for optimal alignment of a query with every sequence in the database. This is referred to as a Smith and Waterman search.)

If you use FASTA you will most likely find it slower than BLAST. An advantage with GCG FASTA is that you can search a subset of the EMBL or Swissprot databases and in this way reduce the time considerably.

TFASTA is a variation of the FASTA program that searches a nt database with protein query. Therefore, TFASTA takes six times longer time to run than FASTA.

FASTA output

Histogram Key:

Each histogram symbol represents 128 search set sequences

Each inset symbol represents 1 search set sequences

Score Init1 Initn

(‑) (+)

< 2 5 5:=

4 0 0:

6 3 3:=

8 22 22:=

10 72 72:=

12 341 341:===

14 394 394:====

16 1304 1304:===========

18 1525 1525:============

20 4001 4001:================================

22 5729 5729:=============================================

24 6806 6806:======================================================

26 7626 7626:============================================================

28 6416 6416:===================================================

30 4848 4592:====================================‑‑

32 3710 3281:==========================‑‑‑

34 2457 2081:=================‑‑‑

36 1740 1441:============‑‑

38 1240 947:========‑‑

40 789 638:=====‑‑

42 517 483:====‑

44 292 496:===+

46 190 437:==++

48 124 387:=+++

50 81 316:=++

52 53 249:=+

54 39 182:=+

56 15 153:=+

58 13 119:=

60 8 91:=

62 20 74:=

64 2 36:=

66 4 33:=

68 1 21:=

70 1 15:=

72 0 20:+

74 0 10:+ :++++++++++

76 0 17:+ :+++++++++++++++++

78 1 5:= :=++++

80 0 2:+ :++

82 0 2:+ :++

84 0 3:+ :+++

86 0 4:+ :++++

88 0 1:+ :+

90 0 4:+ :++++

92 0 2:+ :++

94 0 1:+ :+

96 0 0: :

98 0 2:+ :++

100 0 0: :

>100 19 19:= :===================

The best scores are: init1 initn opt..

swissprotold:sr54_mycmy Q01442 mycoplasma mycoides. sign...2044 2044 2044

swnew:sr54_mycge P47294 mycoplasma genitalium. signal re... 697 773 1038

swissprotold:sr54_bacsu P37105 bacillus subtilis. signal... 477 705 1023

swissprotold:sr54_ecoli P07019 escherichia coli. signal ... 343 588 892

swissprotold:sr54_haein P44518 haemophilus influenzae. s... 341 584 888

swissprotold:sr5c_arath P37107 arabidopsis thaliana (mou... 470 518 895

swissprotold:sr54_yeast P20424 saccharomyces cerevisiae ... 370 435 593

swissprotold:sr54_canfa P13624 canis familiaris (dog). s... 335 413 641

swissprotold:sr54_mouse P14576 mus musculus (mouse). sig... 329 407 635

swissprotold:sr54_arath P37106 arabidopsis thaliana (mou... 343 379 607

swissprotold:sr54_schpo P21565 schizosaccharomyces pombe... 351 351 649

swissprotold:ftsy_haein P44870 haemophilus influenzae. c... 153 304 404

swnew:ftsy_mycge P47539 mycoplasma genitalium. cell divi... 151 283 321

swissprotold:pila_neigo P14929 neisseria gonorrhoeae. pr... 176 257 388

swissprotold:dock_sulso P27414 sulfolobus solfataricus, ... 254 254 458

swissprotold:srpr_yeast P32916 saccharomyces cerevisiae ... 155 224 223

swissprotold:ftsy_ecoli P10121 escherichia coli. cell di... 121 208 407

swissprotold:srpr_human P08240 homo sapiens (human). sig... 121 155 274

swissprotold:srpr_canfa P06625 canis familiaris (dog). s... 121 121 267

swissprotold:rest_human P30622 homo sapiens (human). res... 63 98 79

swissprotold:srmb_ecoli P21507 escherichia coli. atp‑dep... 61 98 97

swissprotold:gyrb_mycpn P22447 mycoplasma pneumoniae. dn... 50 93 60

swissprotold:flhf_bacsu Q01960 bacillus subtilis. flagel... 77 92 168

swissprotold:hs7c_caeel P27420 caenorhabditis elegans. h... 48 92 53

swissprotold:gr78_rat P06761 rattus norvegicus (rat). 78... 49 90 52

swissprotold:gr78_mesau P07823 mesocricetus auratus (gol... 49 90 52

swissprotold:gr78_human P11021 homo sapiens (human). 78 ... 49 90 52

swissprotold:gr78_mouse P20029 mus musculus (mouse). 78 ... 49 90 52

swissprotold:trj2_ecoli P05837 escherichia coli. traj pr... 47 88 54

swissprotold:rnha_human Q08211 homo sapiens (human). atp... 49 86 59

swissprotold:rrpl_hrsva P28887 human respiratory syncyti... 49 85 57

swissprotold:ycf1_tobac P12222 nicotiana tabacum (common... 57 85 77

swissprotold:dnak_mycpa Q00488 mycobacterium paratubercu... 53 85 75

swissprotold:vl2_hpv49 P36762 human papillomavirus type ... 37 84 55

swissprotold:dnak_mycle P19993 mycobacterium leprae. dna... 53 83 80

swissprotold:cse1_yeast P33307 saccharomyces cerevisiae ... 66 83 69

swissprotold:msap_plafw P04933 plasmodium falciparum (is... 45 82 77

swissprotold:pr16_yeast P15938 saccharomyces cerevisiae ... 45 81 56

swissprotold:utro_human P46939 homo sapiens (human). utr... 41 80 90

swissprotold:kar3_yeast P17119 saccharomyces cerevisiae ... 43 79 88

ID SR54_MYCMY STANDARD; PRT; 447 AA.

AC Q01442;

DT 01‑JUL‑1993 (REL. 26, CREATED)

DT 01‑JUL‑1993 (REL. 26, LAST SEQUENCE UPDATE)

DT 01‑FEB‑1995 (REL. 31, LAST ANNOTATION UPDATE)

DE SIGNAL RECOGNITION PARTICLE PROTEIN (FIFTY‑FOUR HOMOLOG). . . .

SCORES Init1: 2044 Initn: 2044 Opt: 2044

100.0% identity in 447 aa overlap

10 20 30 40 50 60

sr54_m MGFGDFLSKRMQKSIEKNMKNSTLNEENIKETLKEIRLSLLEADVNIEAAKEIINNVKQK

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

sr54_m MGFGDFLSKRMQKSIEKNMKNSTLNEENIKETLKEIRLSLLEADVNIEAAKEIINNVKQK

10 20 30 40 50 60

70 80 90 100 110 120

sr54_m ALGGYISEGASAHQQMIKIVHEELVNILGKENAPLDINKKPSVVMMVGLQGSGKTTTANK

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

sr54_m ALGGYISEGASAHQQMIKIVHEELVNILGKENAPLDINKKPSVVMMVGLQGSGKTTTANK

70 80 90 100 110 120

130 140 150 160 170 180

sr54_m LAYLLNKKNKKKVLLVGLDIYRPGAIEQLVQLGQKTNTQVFEKGKQDPVKTAEQALEYAK

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

sr54_m LAYLLNKKNKKKVLLVGLDIYRPGAIEQLVQLGQKTNTQVFEKGKQDPVKTAEQALEYAK

130 140 150 160 170 180

190 200 210 220 230 240

sr54_m ENNFDVVILDTAGRLQVDQVLMKELDNLKKKTSPNEILLVVDGMSGQEIINVTNEFNDKL

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

sr54_m ENNFDVVILDTAGRLQVDQVLMKELDNLKKKTSPNEILLVVDGMSGQEIINVTNEFNDKL

190 200 210 220 230 240

.......

sr54_mycmy

swnew:sr54_mycge

ID SR54_MYCGE STANDARD; PRT; 446 AA.

AC P47294;

DT 01‑FEB‑1996 (REL. 33, CREATED)

DT 01‑FEB‑1996 (REL. 33, LAST SEQUENCE UPDATE)

DT 01‑FEB‑1996 (REL. 33, LAST ANNOTATION UPDATE)

DE SIGNAL RECOGNITION PARTICLE PROTEIN (FIFTY‑FOUR HOMOLOG). . . .

SCORES Init1: 697 Initn: 773 Opt: 1038

44.7% identity in 445 aa overlap

10 20 30 40 50 60

sr54_m MGFGDFLSKRMQKSIEKNMKNSTLNEENIKETLKEIRLSLLEADVNIEAAKEIINNVKQK

| ::||: : ::::|:::: |::|:::: :|||||::||:||||: ::|::|:::::|

sr54_m MFKAMLSSIVMRTMQKKINAQTITEKDVELVLKEIRIALLDADVNLLVVKNFIKAIRDK

10 20 30 40 50

70 80 90 100 110 120

sr54_m ALGGYISEGASAHQQMIKIVHEELVNILGKENAPLDINKKPSVVMMVGLQGSGKTTTANK

::| |: |:: :: ::|::::||:|||:: |: |: :|:| :|||||||||||| :|

sr54_m TVGQTIEPGQDLQKSLLKTIKTELINILSQPNQELN‑EKRPLKIMMVGLQGSGKTTTCGK

PATTERN MATCHING

Homology searches is used for finding related sequences and allows you to search with relative long sequences. Pattern searches is used for finding short sequence patterns in a single sequence, in a group of sequences or in the databases. For instance, the pattern "GDSGGP" is typical of serine proteases. When a sequence or database of sequences is analyzed with respect to this pattern the program only decides whether there is an exact match or not. Consequently, in pattern matching one is not concerned with gaps and substitution matrices as in homology searches.

Patterns may be more complex than the one shown above for serine proteases. The pattern corresponding to motif A of the ATP/GTP-binding site, for instance is:

[AG]-x(4)-G-K-[ST]

which is to be interpreted :

One amino acid that can be either A or G, followed by any four amino acids, followed by G and K and finally one amino acid which is either S or T.

For instance,

AGRCGGKT

GGLGGGKT

AGRSGGKS

AGAAGGKT

are all examples of sequences that will match the pattern.

(The GCG program FINDPATTERNS uses this type of search. The search pattern is your own or some predefined like one in PROSITE. The program MOTIFS specifically search a protein sequence or set of sequences for the PROSITE motifs. In the GCG program package the prosite information is in a file "prosite.patterns". This file may be retrieved by typing at the command line : % fetch prosite.patterns)

Pattern searches are very useful when you encounter a protein and you have no information about its structure or biochemical function. By searching with PROSITE motifs you may, it you are lucky, obtain a clue as to the structure or function of the protein.

Examples of data in PROSITE: