Computers in DNA Technology
An Introduction to the software tools and bioinformatics techniques available to the biotechnology scientist.
By
Elias Eliopoulos
Genetics Laboratory,
Department of Agricultural
Biotechnology,
The Agricultural University
of Athens,
GREECE.
Computers in DNA Technology.
Introduction
1. Computers in Experimental DNA
Research.
1.1.
DNA Sequencing Software.
1.2. Genome Mapping
1.3. Primer Design
Software
1.3.1 Features
1.3.2 Criteria for
primer and primer pair formation
2. Information Retrieval and
Analysis.
2.1.
Making Sence of Sequences
2.2.
The Sequence Databases.
2.2.1 The EMBL Nucleotide
Sequence Database
2.2.2 A Brief Outline of
DDBJ
2.2.3 The GenBank DNA
Sequence Database
2.2.4 The Brookhaven
Protein Structure Databank
2.2.5 The SWISS-PROT
Protein Sequence Databank
2.2.6 Some Meta databases
2.3. Sequence
analysis and retrieval programs
2.3.1 Sequence
Similarity Searching Software
2.3.2 Multiple Sequence
Alignments
2.3.3 Sequence Analysis
Utilities
3. The Internet.
3.1.
Bioinformatics and the Internet.
3.2.
Information Retrieval Through the World Wide Web
3.2.1 Databases.
3.2.2 Biocomputing through the Web.
Scientists
working in the field of gene discovery often find themselves having to move
from their chemistry bench to their personal computer or workstation in order
to design their DNA experiments, execute them, process, analyze and understand
their experimental results, communicate them to the scientific community, and
use the combined knowledge to redefine the goals of their research.
In this chapter
an attempt will be made to highlight various tools and techniques that enable
DNA researchers through the use of the information technology to extract and
manipulate genomic information. This chapter is divided in three parts.
Part 1 is involved with the computer as an
indispensable experimental tool to control experiments collect and analyze raw
data. In particular it refers to automatic and semiautomatic DNA sequencing,
genome mapping and primer design.
Part 2 is concerned with information retrieval at
the levels of DNA and protein sequences highlighting also text retrieval and
structure analysis. A new fragment of DNA off the sequencer would be just
another string of A,T,C and G, the four nucleotides, if it could not be correlated with the billions
of the already known and characterized fragments that are produced, analysed
and documented at an increasing pace now at a million base pairs a day. Making
sense of a nucleotide sequence is only the first step in the field of molecular
biotechnology where understanding structure and function is the ultimate goal.
Through sequence databases, sequence analysis
programs and modelling software the scientist finds ways of making the most of
the experimental results and enhances the understanding in biological and
molecular processes.
Part 3 describes the links between Bioinformatics
and the Internet, two of computer science’s boom areas, through network
computational tools and services for making sense of new DNA and protein
sequences. The major DNA and Protein data containing databases are described as
well as the bioinformatics centers offering services for organizing and
analyzing biological data.
1.1 DNA
Sequencing Software.
Sequencing, protein or DNA, has always been a very
cumbersome repetitive procedure
requiring long experimental times and great attention to the analysis of
the details of the raw data, being a spectrometer output or a gel image [1]. From
the early days of conventional protein sequencing the instrument has always
been microprocessor controlled and the detector data were fed for analysis to a
microcomputer that controlled the instrument through predesigned software
modules and analysed the data. In early DNA sequencing the computer was
introduced to perform the image analysis of the autoradiography picture of the
sequencing gel (figure 1) [2]. The arrival of automated sequencers
revolutionized the field of bioinformatics by enabling biotechnologists to
catalogue sequence information hundreds of times faster than was possible with
preexisting scanning techniques.
Today’s sequencing world is using two major techniques
for sequencing. Manual sequencing using radioactive isotopes [3] and film
exposure and automated sequencing through the use of fluorescent dyes [4].
The manual or semi-automatic approach involves running
a gel with radioactive isotope labels, expose it to a film and then scan it and
process the information off it. Autoradiography is very difficult to automate,
largely because reaction products have to be run in four different gel lanes,
and lane to lane variations in electrophoretic mobility were difficult to
interpret. Advances in computerised image processing have helped in the
automation of the last step. Once the gel data have been digitised using a
hand-held or a flatbed scanner many programs have been developed to help with
the interpretation in to a string of base pairs. The process first involves
aligning the gel lanes using a high resolution graphical interface on the
computer , in order to accommodate distortions on the gel and other geometrical
artifacts (figure 2). Some programs [5,6] can sense the lanes in group of four
on the electropherograms automatically and proceed to resolve ambiguities ,
call bases, align sequences and assemble contigs. Others [7] just provide the
tools, rulers, colours, background processing or geometrical distortion
correction leaving the actual choice of base to the user. One invaluable
facility all the programs have is that the base sequence is automatically
recorded and stored in the common sequence formats (PIR, FASTA) to be used
later in other programs alleviating the cumbersome and often mistake prone
process of recording the sequence, typing it and checking it. Assembly managers
of that type claim that can handle 1000 individual sequences and build a contig
of up to 50kbases in a couple of minutes.
The replacement of conventional autoradiography
techniques with real time fluorescence detection [8] has revolutionised
automated DNA sequencing. The use of multiple wavelength fluorescence detection
allows dideoxysequencing ladders, generated using four distinct fluorescent
dyes, to be analysed in a single electrophoresis lane. This eliminates the lane
to lane electrophoretic variability and at the same time offers fast automatic
detection through the use of laser technology and computer algorithms without
significant human intervention (figure 3).
The automatic interpretation and storing of the DNA
sequencing is the first only step in automated sequence processing. Specialised
software [9] is used to clean up sequence files from vector sequences and
ambiguously called bases prior assembly and analysis. After obtaining clean data,
several utilities are offered to the scientist for desktop processing. These include the creation of DNA,RNA
and conceptual translation of DNA/RNA codons to aminoacid sequence files,
performing pairwise and multiple DNA and aminoacid sequence alignment, compare
matches or mismatches between sequences and create ambiguity, unanimity or
consensus sequence, search for patterns and restriction sites and editing and
exporting facilities [10].
When
many overlapping fragments of DNA have been sequenced the laborious task of
sequence assembly can also be automated making sequence assembly fast and
accurate.
1.2. Genome
Mapping.
To
extend further from DNA sequencing to genome mapping [11], fluorescent dye
technology has helped automate techniques such as sizing and quantitation using
an internal lane standard. The emission wavelength of the fluorophore-labelled
internal lane standard is recognised automatically and the known fragment
lengths and quantities of the standard are used by the software to create a
calibration curve. PCR product sizes are determined automatically (figure 4) from
the calibration curve and since they run on the same lane on the
electropherogram with the standard there are no electrophoretic mobility
variations. In addition the synchronous use of four different fluorescent dyes
allows the multiplexing of routine PCR analyses [12].
Using
this technology DNA fragment analysis and quantitasation can be performed fast
and reliably. Flexible software and a broad linear dynamic range for fluorescence
detection make data editing and interpretation easy [13], even when there is a
large variation in DNA concentration between fragment bands. Fragment size can
be determined precisely even if alleles differ as little as one base pair.
Microsattelites
, also known as simple sequence tandem repeat polymorphisms, are fast becoming
the markers of choice for linkage mapping of genes [14]. Because there are
numerous, polymorphic and far easier to find than classical restriction
fragment length polymorphisms there are steadily replacing current lengthy RFLP
analyses. Their variation (alleles) can be realized as different size PCR
products. These products are fluorescently labelled and can be analysed
precisely and automatically. Microsatellite alleles can be determined accurately even if alleles differ as little
as one base pair and allowing to complete gene mapping three to four times
faster. The processing software not only interprets the raw detector data but
can automatically determine the amount of fluorescence signal through the
calculation of peak height and area [15] (figure 5).
Utilising
the fact that changes as slight as single base substitutions can alter the
conformation of single-strand DNA molecules and result in different
electrophoretic mobilities in a non denaturing gel allows wild type and mutant
DNA , or different alleles to be readily distinguished from each other. This
PCR-SSCP analysis in combination with fluorescent dye technology is fully
automated through the use of computer controlled electrophoretic devices and
microcomputer analysis.
Genotyping
applications based on table formation of analysed fragment data extend further.
For
linkage mapping a table of alleles can be constructed from the interpretation
of peak data.
For
AFLP a comparative analysis table of polymorphic peaks shows the presence or
absence of marker peaks.
In
association with a pedigree database a Mendelian inheritance check can be
performed for disease gene mapping, forensic research or paternity testing.
1.3. Primer Design
Software.
Many
computer applications [16,17,18] are available to find suitable primers or form optimal primer's
pairs as probes for PCR applications. They differ in their
graphical interface, ease of use or options available. The primers
selected by the software should maximally
satisfy various requirements, providing high specific and efficient
amplification, and may be used in a broad range of PCR applications,
sequencing primers and hybridization probes.
For
example a special check for repeated sequences in the beginning of the design
[18] is useful in order to avoid
unspecific annealings with the DNA. This is especially necessary for small,
known sequences like plasmids, viruses or mitochondrial DNA. Also tests of the
annealing of the primers with themselves, with each other (for primer pairs) or
with the DNA should be considered.
1.3.1 Features.
Several
features of primer design are common in the various computer programs :
· Creation of new primer pairs.
An
automatic search and selection of
best primers and formation of optimal primer's pairs for PCR can be performed. Some programs
process one or
several nucleotide sequences
simultaneously and select primers, which
are universal for
all considered sequences or specific for only one out of all sequences.
The “universal” option in the designer programs is useful for the
development of PCR-based
diagnostic systems. The “specific mode” allows it to obtain
pairs of primers,
providing specific amplification of fragments of only one
nucleotide sequence out of all
proposed ones. All primers are tested in respect of their capability of binding
in non-specific places (false priming) on the target sequence and on all other
sequences included in the test.
· Creating of one suitable primer to a
given primer.
With
this option the search of specific primers is performed in one or in many
related nucleotide sequences in order
to form primer pairs(figure 6). The user may type-in the sequence of the
oligonucleotide to be tested and besides the common symbols A, G, C and T ambiguous
bases may be included.
· Finding of unique sequences within a
sequence.
This
feature is useful in order to select primers which will amplify a certain
fragment whithin a specific sequence (if working in specific mode) or amplify respective fragments of all proposed
sequences (if working in universal
mode). The user locates the left (5') and
right ( 3') boundaries of the target fragment on the
first sequence, and the best primers are being searched within close
zones ( 100 nucleotides ) up- and
down- stream of the fragment.
The
programs allow to carry out the search and selection of primers within the interesting sites of sequences, or to carry
out selection of primers and formation of primer's pairs, which provide amplification of interesting
fragment of sequences. This may be necessary for the selections of primers
for site-directed mutagenesis or for preparation for cloning
or sequencing.
· Setting temperature limits.
Setting
“temperature limits” allows to set
permissible lower and upper limits
for annealing temperature
of primers or a permissible difference between annealing temperatures of primers sharing one pair .
Temperatures are determined by the
nearest-neighbor method by taking into account the nucleation constant. The annealing temperature is
estimated with formulas such as
TA = 62.3
+ 0,41 * %CG - 500 / ntprimer
· Assessing of existing primers or primer pairs.
Assessing
of existing primers or primer pairs allows testing of candidate
oligonucleotides with respect to any
sequence and/or to
some intrinsic properties of
the oligonucleotide itself . For the first case the annealing
temperature of the primer
connected with possible partial
uncomplementarity of the primer in the target sequence is given. This
temperature is the temperature value to be used during the first 4-5
cycles of the PCR.
Concerning the second, information about the melting temperature,
molecular weight and the possibility of stem-loop structure formation is displayed. The length, annealing
temperature and mol. weight of oligonucleotide is also displayed together with
probable hairpins.
Some
other features available but not common in all programs include:
· Finding of repeats within a sequence.
· Handling of long sequences up to 40.000
bp.
· Looking for primers including
restriction sites.
· Include design of misparing primers to
create restriction sites
All
searches of suitable primers can be done by following different criteria.
a.
Energetic criteria allow to
set the number
of most preferable in energetic
terms sites of sequences for binding of the primer under test. The energy
values considered are MAX.ENERGY ( Gm) - the binding energy ( exactly the
standard free energy of Gibbs) of the primer on the template in a specific
place - and DELTA ( Gm - Gs) - the difference between the binding energies of
the primer and template in a specific site and the most profitable non-specific
place (difference between energetic maximum and submaximum) (figure 7).
b.
Complementarity or homology criteria allow to find sites of
sequences where the examined oligonucleotide has a level of complementarity
or homology respectively equal
or exceeding a threshold set by user.
1.3.2 Criteria for
primer and primer pair formation.
The
program should pay particular attention to the
evaluation of the specificity of
the selected primers. The selected primer
must be able
to reliably bind and extend in a specific place of the right sequence (if working mode is
"specific") or in respective
sites of all considered sequences (if mode is
"universal") and have
the least probability of
primer-template binding and extension,
in any other place. To estimate the reliability of specificity and
probability of non-specific
interactions the energetic criterion is used. It is based on assumption that a
certain primer interacts with the
template DNA in a random way, then the
probability and reliability of primer-template
binding is primarily defined by thermodynamic
stability of the duplex and accurately described by the
standard free energy of
Gibbs (dG).
The
program should be able to adapt to the sequences examined. Selection criteria
should have stringent default values that can be adjusted (softened)
automatically or manualy in cases that no pairs of primers are predicted within
set upper and lower limits.
Besides
to be highly specific, the selected primers must also have some of the
following properties:
a. they should
contain no sequence repeat structure;
b. they should
contain no large GC-rich sites and no stretched repeats of the same
nucleotides;
c. they should be
free of hairpins (the annealing temperature of primer must be at least 25 C
higher than melting temperature of most stable hairpin, which could be formed
by this primer);
d. they should not form any dimer structures;
e. their annealing
temperature should be within set limits ( but not lower than 35 C and not
higher than 73 C ).
f. Particular
attention should be paid to the 3'-ends of primers, since their properties
determine the possibility
or impossibility of the primer extending. For example, the
universal primers must have
at least three complementary nucleotides at their 3'-terminii in
all target places on all sequences. On the other hand, during the search
of an energetic submaximum ( place of most profitable nonspecific binding ), all sites, where the primer has
complementarity of even two 3'
nucleotides, are considered as
sites of possible
primer's extension.
In
order to form primer pairs additionally is required:
a. The difference
between annealing temperatures of
both primers to be no more than the value set by the user;
b. The size of
fragments flanked by primers is within
the limits set by the user;
c. Primers should not
bind with each other (the maximal
possible energy of their binding
must not exceed the submaximum of each primer);
d. The 3'-end of one
primer can not bind with any site of other one;
e. No one primer
forms a hairpin at the annealing temperature
of second primer in pair;
f. The length of
fragments of different sequences amplified by universal primers do not differ
from each other more then by a value
set by the user.
2.1 Making Sense
of Sequences.
Over
the past few years the growth of information concerning genomic data has been
exponential [19]. From the few hundred protein sequences of the 80’s, databases
now number 70.000 sequences or 650 million nucleotide bases. Complete genomes
of bacteria and yeast have been sequenced and by the year 2005 the 3 billion
bases of the human genome are expected to be deciphered.
In
order this information to be useful it is important to be able first to locate
the true genomic information in the determined nucleotide sequences, the actual
genes expressing the proteins, the workhorse molecules in life, which in the
case of the human genome amount only in the 3% of the human genome.
Then
the task is to determine the function of the protein molecules expressed by
finding related genes and proteins whose function are already known. Since the
functionality of a protein remains closely linked to its three-dimensional
structure (the way the protein folds in its proper environment) and how its
subunits interact with themselves and external ligands it is necessary to be
able to define the conformation of the protein molecule by comparison with
other proteins of known structure or by predicting its fold from its primary
(aminoacid) sequence.
In
order to reach the ultimate goal of fully understanding the role of a protein
in a biological process one should be able to know the fine details of
interactions of the molecules concerned and should be able to redesign it, to
attempt to redirect or adjust its function, or to make useful ligands,
inhibitors or antagonists to its natural ligand in order to benefit from their functionality as drugs or
catalysts.
Computational
biology over the last twenty years have attacked these issues in many different
ways and what follows is just a fraction of the tools available to
biotechnologists to enhance their understanding of biological processes through
genomic information.
The
sequence databases fall in two categories. Those primary source databases that
are made from DNA and RNA sequences collected from the scientific literature
and patent applications or directly submitted from researchers and sequencing
groups (EMBL [20], GENBANK [21], DDBJ, SWISSPROT [22], PDB [23,24]) and those
metabases where entries are consolidated from the primary source databases
after extensive processing and cross correlation of entries (TREMBL [22], EST
[30], SST [32] etc.). The three major primary databases, EMBL, GENBANK and DDBJ
are produced in as a collaborative effort between the European Molecular
Biology Laboratory, the NCBI GenBank group (USA) and the DNA Database of
Japan (DDBJ). Each of the three groups
collects a portion of the total sequence data reported worldwide, and all new
and updated database entries are exchanged between the groups on a daily basis.
Since detailed analyses of the databases are given elsewhere [26] a short
description of the primary source databases and some metabases will follow together
with some aspects of their search
engines. Table 1 contains the internet addresses of the databases described.
2.2.1 The EMBL
Nucleotide Sequence Database
The
EMBL Nucleotide Sequence Database is composed of DNA and RNA sequence entries.
Each entry corresponds to a single contiguous sequence as contributed or
reported in the literature. In many cases, entries have been assembled from
several papers reporting overlapping sequence regions. Conversely, a single
paper often provides data for several entries, as when homologous sequences
from different organisms are compared. The database currently doubles in size
approximately every 12 months and currently (June 1997) contains over 931
million bases from 1432941 sequence entries.
The
nucleotide sequence data are generally present in the database as they have
been published, subject to some conventions which have been adopted for the
database as a whole. The sequences are always listed in the direction 5' to 3',
regardless of the published order. Bases are numbered sequentially beginning
with 1 at the 5' end of the sequence.
The
sequences are presented in the database in a form corresponding to the
biological state of the information in vivo. Thus, cDNA sequences are stored in
the database as RNA sequences, even though they usually appear in the
literature as DNA. For genomic data, the coding strand is stored. Data
containing coding sequences on both strands are stored according to the
prevailing conventions in the literature. The stored data generally correspond
to wildtype sequences before mutation or genetic manipulation.
Sequences
of tRNA molecules are stored as unmodified RNA sequences (equivalent to the
mature transcript before any base modification occurs). This form (colinear
with the genomic sequence) has been adopted to simplify both storage and
analysis of the sequences. Thus, a modified base appears in the sequence as the
corresponding unmodified base. However, each base modification is noted in the
feature table, so that the mature tRNA sequence can be restored automatically
by a simple computer program if this is desirable. The two-letter code used by
Sprinzl and Gauss (published in Nucleic Acids Research in the first issue of
each year) has been adopted for abbreviation of modified bases in the feature
table.
2.2.2 A Brief Outline
of DDBJ
DDBJ
began DNA data bank activities in Japan in 1984 with the endorsements of the
Ministry of Education, Science, Sport, and Culture and representative Japanese
molecular biologists. DDBJ is the sole DNA data bank in Japan, which is
officially certified to collect DNA sequences from mainly Japanese researchers
and to issue accession numbers to data submitters.
DDBJ
is designed to operate as one of the International DNA Databases, including
EMBL in Europe and GENBANK in the USA.
2.2.3 The GenBank DNA Sequence Database
GenBank
(21) is the U.S. National Institute’s of Health (NIH) database of all known
nucleotide and protein sequences including supporting bibliographic and
biological information. Since 1992 it has been based at the National Center for
Biotechnology Information (NCBI), a division of the National Library of
Medicine, located on the NIH campus.
The
data is made available at no cost through the Internet, either by downloading
database files or by text and sequence similarity search services.
As
of Release 90.0 in August, 1995, GenBank contained over 353,713,490 nucleotide
bases from 492,483 different sequences. Although human entries predominate,
constituting 54% of the total, more than 15,500 species are represented. After
homo sapiens, the top five species in terms of bases include C. elegans, S.
cerevisiae, mus musculus, and arabidopsis thaliana. Historically, the database
has been doubling in size every 22 months, but that rate has rapidly
accelerated due to the enormous growth in data from expressed sequence tags or
ESTs . Over 56% of the sequences in the current release are ESTs .
The
data in GenBank come from two primary sources: 1) authors who submit data
directly to the collaborating databases, and 2) annotators at NCBI who extract
the information from relevant journals. GenBank has a journal scanning
operation to scan the current literature from over 3500 journals and identify
sequences which have been published but were not submitted directly by authors.
This operation has also proven successful in updating publication information
and in identifying sequences that had been submitted confidentially and should
be released.
Each
GenBank entry includes a concise description of the sequence, the scientific
name and taxonomy of the source organism, and a table of features that
identifies coding regions and other sites of biological significance, such as
transcription units, sites of mutations or modifications, and repeats. Protein
translations for coding regions are included in the feature table.
Bibliographic references is included along with a link to the Medline unique
identifier for all published sequences (27).
· Retrieving GenBank Data with the ENTREZ
System
Entrez
is an integrated database retrieval system [28] which accesses DNA and protein
sequence data, related MEDLINE references, a collection of genome data and 3-D
structures from MMDB (29). The DNA and protein sequence data are integrated
from a variety of sources, including GenBank, EMBL, DDBJ, dbEST, dbSTS, GSDB,
PIR, SWISS-PROT, Protein Research Foundation (PRF), PDB and patents. The
MEDLINE references are papers indexed under the NLM's Medical Subject Heading
(MeSH), 'genetics'. The genome dataset includes information on the large-scale
organization of completely sequenced chromosomes or genomes, such as physical
and genetic maps and aligned sequences.
The
DNA sequence, protein sequence, MEDLINE, genome and 3-D structure data are
linked to provide easy traversal among the datasets. Entrez provides an entry
point into sequence or bibliographic records by simple Boolean queries. From
the record, hypertext links may be used to navigate through the information
space using a point-and-click interface. Some of the links are simple
cross-references, for example, between a sequence and the abstract of the paper
in which the sequence was reported, or between a protein sequence and its
corresponding DNA sequence. Other links, however, are based on computed
similarities among the sequences or among the textual documents. The
precomputed 'neighbors' allow very rapid access for browsing groups of related
records.
The
server/client network version of Entrez operates with a client program on a
user's machine over the Internet to a server located at the NCBI. Client
programs for Macintosh, PC, and Unix computers can be obtained by downloading
from 'ncbi.nlm.nih.gov' in the 'entrez/network'' directory. This version
combines the full functionality of the Entrez interface with access to an
extended set of 1.2 million Medline citations and to the MMDB structure
database. The sequences are updated daily and the MEDLINE weekly. The Web
version of Entrez also provides access to the same sequence and MEDLINE
datasets through standard Web browsers such as Netscape and Mosaic. Additional
links are provided to external data sources such map information from FlyBase
for Drosophila sequence records and to the full text of publicly available
online journal articles.
2.2.4 The Brookhaven Protein Data Bank (PDB)
of macromolecular three dimensional structures.
The
Protein Data Bank (PDB) [23] is an archive of experimentally determined
three-dimensional structures of biological macromolecules, serving a global
community of researchers, educators, and students. The archives contain atomic
coordinates, bibliographic citations, primary and secondary structure
information, as well as crystallographic structure factors and NMR experimental
data. The PDB Newsletter and CD ROM are published quarterly.
2.2.5 The SWISS-PROT
Protein Sequence Data Bank
The
SWISS-PROT Protein Sequence Data Bank [22] is a database of protein sequences
produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. It
contains high-quality annotation and is non-redundant. Release 34.0 of
SWISS-PROT contains 59.021 sequence entries, comprising 21.210.389 amino acids
abstracted from 50.052 references.
Each entry corresponds to a single
contiguous sequence as
contributed to the bank or reported in the literature. In some
cases, entries have been assembled from several papers that report overlapping
sequence regions. Conversely, a single paper can provide data
for several entries, e.g.
when related sequences from different
organisms are reported.
In
SWISS-PROT, as in most other sequence databases, two classes of data can
be distinguished: the core data
and the annotation.
For each sequence entry the core
data consists of the citation information (bibliographical references), the taxonomic data (description of the
biological source of the
protein) and the sequence data. Except for initiator N-terminal
methionine residues, which are
not included in a sequence when their absence from the
mature sequence has been proven, the sequence data correspond to the
precursor form of a protein before post-translational modifications and
processing.
The
annotation consists of the description of the following items:
o
Function(s) of the protein
o
Post-translational
modification(s). For example
carbohydrates,
phosphorylation, acetylation,
GPI-anchor, etc.
o
Domains and sites. For example
calcium binding regions, ATP-binding
sites, zinc fingers, homeobox, kringle,
etc.
o
Secondary structure
o
Quaternary structure
o
Similarities to other proteins
o
Disease(s) associated with deficiencie(s) in the protein
o
Sequence conflicts, variants, etc.
The
major advance of SWISSPROT is its comprehensive cross-reference to other
databases. Currently 25 different databases are cross-referenced.
Cross-references are provided in the form
of pointers to information related to
SWISS-PROT entries and found in data collections other than SWISS-PROT. Cross-references are made in
SWISS-PROT to:
o
The X-ray crystallography
Protein Data Bank (PDB).
o
The Mendelian Inheritance in Man
data bank (MIM).
o
The PROSITE dictionary of sites and patterns in proteins [25]
o
The restriction enzymes
database (REBASE)
o
The G-protein--coupled receptor
database (GCRDb)
o
The EcoGene section of the EcoSeq/EcoMap integrated Escherichia
coli K12 database
o
StyGene section of the StySeq/StyMap integrated Salmonella typhimurium LT2 database,
o
The gene-protein database
of Escherichia coli K12 (2D-gel
spots) (ECO2DBASE).
o
The SubtiList relational
database for the Bacillus
subtilis 168 genome
o
The LISTA database of yeast
(Saccharomyces cerevisiae) genes coding
for proteins.
o
The human keratinocyte 2D gel protein database.
o
The 2D gel protein
database (SWISS-2DPAGE).
o The Yeast
Electrophoresis Protein Database (YEPD)
o
The Harefield Hospital 2D gel protein databases.
o
The Drosophila genome
database (FlyBase) .
o
The Maize genome database
(MaizeDB) .
o
The WormPep database .
o
The DictyDb database .
o
The Human Retroviruses and AIDS compilation of nucleic and amino acid
sequences (HIV Sequence Database)
o
The database of
Homology-derived Secondary
Structure of Proteins (HSSP)
o
The transcription factor
database (Transfac)
2.2.6 Some Meta databases
It is
important to provide the users
of biomolecular databases with a degree of
integration between the three
types of sequence-related databases (nucleic
acid sequences, protein
sequences and protein
tertiary structures) as
well as with specialized data collections.
ESTs
or 'expressed sequence tags' are the most rapidly-expanding source of new
genes. Last year there were 50,214 sequences in the EST Division of GenBank
(dbEST) [30] with 45% of these derived from human tissues [31]. Currently dbEST
contains 328,905 sequences with 80% coming from about 50 different human
tissues or cell types. The remaining 20% are derived from 43 other organisms
with Arabidopsis thaliana (21,056), Caenorhabditis elegans (12,102) and Oryza
satia (11,015) being abundantly represented by more than 10,000 ESTs each.
The
STS or ‘sequence tagged sites’ division of GenBank (dbSTS) [32] continues to
expand rapidly. The most notable development is the availability of
approximately 2,000 gene-based STSs derived from ESTs and mapped on human
genomic DNA cloned in yeast artificial chromosome (YAC) libraries by the
Whitehead/MIT genome center by T. Hudson and colleagues. dbSTS contains PCR
primers, reaction conditions and map locations for gene-based markers. Such
data are used to support the visualization of transcript maps in the new
genomes division of GenBank.
The
UniGene collection, now accessible through NCBI's Home Page, contains more than
48,000 clusters of sequences, each representing the transcription product of a
distinct human gene. With current estimates of 80,000 to 100,000 genes in the
human genome, this is close to the 50% mark. The clusters are largely based on
EST sequences, so most of the sequences are not complete and most of the genes
have still not been characterized. But one important use of the UniGene
clusters is to identify novel, nonredundant mapping candidates for generating a
transcript map that identifies all coding sequences in the genome.
TREMBL, a computer-annotated supplement to
SWISS-PROT [22] contains the translations of all coding sequences (CDS) present
in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT. It
is organized in five subsections:
a. Immunoglobulins and T-cell receptors
b. Synthetic sequences
c. Patent application sequences (coding
sequences captured from patent applications).
d. Small fragments (fragments with less than
seven amino acids).
e. CDS not coding for real proteins (CDS
translations where strong evidence
exists that these CDS are not coding for real proteins).
Molecular
Modelling Database (MMDB)
NCBI
has begun producing a database of macromolecular 3-D structure information
[29], specifically aimed at molecular modelling research. MMDB is based upon
the data in the Brookhaven Protein Data Bank (PDB). By reorganizing and
validating the PDB data, MMDB provides explicit descriptions of a biopolymer's
spatial structure, its chemical organization, and the linkage between the two.
With MMDB, there is a clear cross-referencing between the three-dimensional
structure and the chemistry of a macromolecule. Explicit linkages have been
made to the sequence entries within the ID database so that, with the
appropriate graphics software, users are able to view the 3-D structure of
proteins identified via text or similarity searching of the sequence database.
2.3 Sequence analysis and retrieval programs .
The
value of sequence databases depends not only on their role as repositories of
information but also on their properties as computable objects for information
retrieval and analysis. The volume of current sequence data is beyond any human
processing capacity even for simple tasks such as looking for closely related
sequences (protein or nucleic acid). Automated tools have been developed to
perform repetitive tasks, such as homology searching, in a fraction of the time
of manual processing and with greater accuracy, and to provide user friendly
graphical interfaces to analyse the long chains of proteins and nucleic acids.
The software aiming to speed and functionality falls into three categories.
Computer programs to perform similarity searches of protein and nucleic acid
sequences on the sequence databases. DNA design, manipulation and drawing tools
with graphical interfaces including sequence alignments. And finally software
that combine sequence analysis with property definition (immunogenicity,
hydrophobicity, restriction mapping, etc.) and structure prediction of proteins
and nucleic acids. Frequently computer analysts combine tools for all three
categories in order to produce software packages automating many aspects of
biological information handling.
2.3.1 Sequence
similarity searching software.
Several
issues have to addressed in performing similarity searches on databases.
Programs have to be sensitive, to find all relevant matches, specific, in order
to avoid irrelevant matches occuring by chance, efficient to perform the task
in a reasonable timeframe (minutes rather than days) while keeping into
consideration the biological context (rules of assembly and observations) and
properties of the objects of analysis (nucleic acids or proteins).
Identifying
sequence similarities requires a search of the query sequence against all
possible positions in the target sequence and for all sequences in the database
queried. Additional considerations of the presence of insertions or deletions
in the sequence make the combinatorial problem very complex. The application of
dynamic programming techniques with the addition of hash table or neighbourhood
searches has produced combined algorithms to accelerate while significantly
improving the discriminatory ability.
The
statistical significance of a similarity score heavily relies on the scoring
scheme or mutation data matrix, an numerical evaluation of what is considered
similar or disimilar between aminoacids or base pairs. The protein alphabet
having twenty characters is of course more sensitive than the four character
alphabet of the nucleic acids in addition to the codon degeneracy of the
nucleotide sequences. This implies that in searches on translated DNA sequences
on to protein sequences matches are statistically more significant than the
searches on non-translated DNA
sequences.
The
ultimate purpose of sequence database searching is the identification of
macromolecules with related structures and functions. Protein structure and
folding is closely related to the aminoacid sequence. Since symbolic pattern matching on the sequences
is what is offered by the described similarity searching algorithms and since
protein molecules can have similar three dimensional structures and yet non
homologous sequences, limits to what is significantly homologous are
considered. The region between 20 and 30% homology for aminoacid sequence are
the grey areas where pattern matches may or may not correlate with structural
and functional similarity.
Functional
similarity maybe the result of conserved patterns of structure that may not
always be close in the linear representation of the protein sequence. Modern
searching programs include profile matching methodology, where entire motifs
can be searched in the databases. From the other end the databases now offer
through relational indexing, sufficient annotation and classification in the
biological and motif context to facilitate more accurate scoring beyond
symbolic pattern matching.
The
two most popular programs for database
searching for sequence similarities are FASTA and BLAST.
FASTA
performs a Pearson and Lipman [33] search for similarity between a query
aminoacid sequence and any group of
aminoacid sequences that either
reside in the user’s computer or is a database such as GENBANK or EMBL (figure
8). An extension to this program TFASTA does the search of the aminoacid
sequence on any group of nucleotide sequences. TFASTA translates the nucleotide
sequences in all six frames before performing the comparison therefore
searching also all “implied aminoacid sequences”.
BLAST [34] (Basic Local Alignment Search
Tool) is the heuristic search
algorithm employed by the programs blastp (for comparison of an amino acid query sequence
against a protein sequence
database), blastn (for comparison of
a nucleotide query sequence
against a nucleotide sequence
database), blastx (for comparison
of the
six-frame conceptual translation products
of a nucleotide query sequence (both strands) against a protein
sequence database), tblastn (for
comparison of a protein
query sequence against
a nucleotide sequence database dynamically translated
in all six reading frames
(both strands), and tblastx
(for comparison of the six-frame translations of a nucleotide query
sequence against the six-frame translations of a nucleotide sequence database)
; these programs ascribe
significance to their findings using the statistical methods
of Karlin and Altschul with a few enhancements. The BLAST
programs were tailored for sequence similarity searching
-- for example to identify
homologs to a
query sequence [35,36]. The
programs are not generally useful for motif- style searching.
The
fundamental unit of BLAST algorithm output is the High scoring Segment Pair (HSP).
An HSP consists of two sequence fragments of arbitrary but equal length
whose alignment is locally
maximal and for which the
alignment score meets or exceeds a threshold or cutoff score.
The
approach to similarity searching taken by the BLAST programs is
first to look for similar segments (HSPs) between the query sequence and
a database sequence, then to evaluate the statistical significance of any
matches that were found, and finally to report only those matches that
satisfy a user-selectable
threshold of significance..
The
default scoring matrix used by blastp, blastx,
tblastn, and tblastx is the BLOSUM62 matrix [37]. Several PAM (point accepted
mutations per 100 residues)
amino acid scoring matrices are
provided in the BLAST
software distribution,
including the PAM40,
PAM120, and PAM250. While the BLOSUM62 matrix is a good general
purpose scoring matrix and is the
default matrix used by the BLAST
programs, if one
is restricted to using only PAM
scoring matrices, then the PAM120 is recommended for general protein
similarity searches [38] .
Regardless
of the scoring scheme employed,
two stringent criteria must
be met in order to be able to
calculate the Karlin-Altschul parameters.
First, given the residue
composition for the query
sequence and the residue
composition assumed for the database,
the alignment score expected for any randomly
selected pair of residues (one from the query sequence and one from the
database) must be negative. Second, given the
sequence residue compositions and the scoring scheme, a positive score must be
possible to achieve.
BLITZ is an automatic electronic mail server for the MPsrch
program. MPsrch performs sensitive and extremely fast comparisons of protein
sequences against the Swiss-Prot protein sequence database using the Smith and
Waterman best local similarity algorithm [39]. It runs on the MasPar family of
massively parallel computers. A typical search time for a query sequence of 400
amino acids is approximately 40 seconds to search the entire Swiss-Prot 23
release. Additional time is required to reconstruct the alignments; the time
for this will depend on the number of alignments requested. MPsrch uses the well
known Smith and Waterman algorithm for searching the database. The algorithm
looks for the best "local" (short matching regions such as binding
sites, in the middle of long sequences) match as determined by the amino acid
similarity weights to the different possible pairs of aligned residues and
known as PAM matrices and a penalty that is subtracted from the alignment score
for every residue that has been inserted or deleted in the best local
alignment.
2.3.2 Multiple
sequence alignments (MSA).
Once
a related to the query sequence is identified, several other significant
matches is likely to be found. The multiple alignment of a group of sequences
or a family of proteins provides more information than the alignment of any
pair of sequences. MSA is a generalisation of the pairwise sequence alignment,
but the complexity of the calculation grows exponentially with the increase of
the sample of sequences considered. An approximate algorithm uses a trial
multiple sequence alignment as the basis for realigning each of the component
sequences iteratively, until each is an optimal alignment. Other approaches
incorporate into the multiple alignment chemical and biological knowledge thus
greatly improving the reliability of results.
The
most popular software for multiple alignment is CLUSTAL [40], an automated program that also provides generation of phylogenetic trees and profile
alignment (figure 9). Others allow the user to interactively, with the use of
computer graphics, define, lock and shift different portions of the sequence
data and thus guiding the alignment process.
Profile
similarity searching is closely linked to multiple alignments. A profile,
representing a group of aligned sequences, or sequence motif such as those
defined in the PROSITE Dictionary of Protein Sites [ 25], is used as a probe to
search a database for new sequences with similarity to the profile or the
motif. Several profile generation and searching programs exist and are usually
linked to the major databases or the multiple alignment software.
2.3.3. Sequence
analysis utilities.
A
database search or a multiple sequence alignment is only the first step in the sequence analysis process. Several
graphics based programs offer numerous tools for efficient organisation,
manipulation and indexed storing of sequence information. Powerful display
windows allow easy editing and analysis through algorithms simulating molecular
biological processes and techniques.
In
analysing nucleic acid sequences, features include drawing and manipulating of
plasmid, circular and linear restriction maps (figure 10), restriction enzyme
selection and tabling, RNA folding and ORF analysis [16]. Some specialised
programs also offer a gel electrophoresis pattern display [41]
For
protein sequences, profile alignment is followed by secondary structure
prediction, hydrophobicity and immunogenicity profiles, helical wheels,
aminoacid content and molecular weight determination, phylogenetic trees and
numerous others [42].
Editing
facilities incorporate fragment assembly, contig and translation and reverse
translation editors, consensus sequence generation, publication tools and
input/output format handling [43].
At
the moment of writing more than twenty commercial and hundreds of public domain
computer programs are available to genomic research. The choice is enormous in
the graphics tools offered for manipulating, storing and organising genetic
data, efficiency and accuracy of implemented algorithms and convenience and
ease of use. The most popular packages are those that take care of the details
efficiently and allow the user to follow his thought process. In that helps
interconnectivity with standard experimental and database formats and
modularity with wide functionality. Whole databases describing the available
software exist and perhaps is the best starting point to the novel user (Table
2).
3.1. The Internet.
The
Internet is a global communication system that through the use of computer
networks connects million of users and provides access to enormous amounts of
information and services. Its infrastructure is based on fast telecommunication
lines transferring digital data in the form of electronic packets labelled with
a specific direction. These packets are forwarded individually by adjacent
computers on the network, acting as routers, and are reassembled in their
original form at their destination. Packet switching allows multiple users to
send information across the network both efficiently and simultaneously. Half of
the Internet Networks are commercial and one third are associated with
educational and research institutions.
For
most scientists the Internet started as a tool for scientific collaborations
through the use of facilities such as the electronic mail, the file transfer
(ftp), the direct link to remote computers (telnet) and more recently the World
Wide Web (WWW). The WWW is a hypertext based internet service providing
information and resources through the use of specialised software but with
globally agreed communication protocols, the browsers. The browsers offer the
graphical user interface (html language) and the communication tools
(protocols) for a computer (the client) to link to another computer (the
server) and exchange information across the network. The information may be in
the form of text, numerical data, image, sound, video or anything else that can
be represented with a string of numbers.
Since
the power of the local computer (the client) can also be used to process the
raw information locally, more recently a new approach, a Web programming
language(Java) has been developed, where some of the software (the
applets) is downloaded to the client
and is used to process the downloaded information. With this approach the data
incompatibility problem is resolved in addition to the development of
sophisticated display and searching tools.
For
the biotechnologist the parallel growth of bioinformatics and network services
has created a revolution in the way of thinking, extracting and processing the
vast amounts of biological information available. The World Wide Web enables
scientists working on virtually any computer to access remote databases and run
queries and still receive their results in minutes.
The
researcher can remain up to date on
bibliography, by browsing recent journals or use the commercial or
public services offered, without leaving the laboratory. Before investing time
in designing experiments he can query the nightly updated databases, on line,
to identify clues for the DNA sequence, the protein structure, the evolutionary
pattern (and endless others) of interest.
The
list of services for the molecular biologist on the network cannot be described
on a few pages. Instead the preferable way is through network browsing with a
starting point on several service providers of biological information such as
those given on table 2 Instead the network use of the previously described databases will be highlighted as
well as the use of some biocomputing tools that are found and used on the
network.
3.2. Information
retrieval through the World Wide Web.
3.2.1. Databases.
The
explosive growth of the DNA databases is causing serious problems for those
wishing to use them on microcomputers. The current GENBANK release can hardly
fit of 5 CD (each of 600 Mbytes) and that is only for the raw data. The
distribution is quarterly while in its residence (NIH), as with all major
databases, is being updated nightly. Searching a database across the network on
big service providers can take just seconds and queries can often be processed
and received back within minutes.
Most
of the described on section 2.2. DNA and protein databases can be searched on
the World Wide Web (WWW) through the home page of the resident site or other
mirror sites all over the word by using a general browser, such as Netscape
[44], Mosaic [45] or Microsoft Internet Explorer [46]. A general hypertext form
is provided by the web server where the user can choose the databases or
subsets used, the parameters for searching and facilities to import the query
sequence or text. The WWW addresses of such service providers is given on
table3.
Furthermore
through special server/client implementations databases can be searched by a
client resident program combining the full functionality of specific interfaces
with access to an extended and frequently updated group of databases.
The
Entrez interface to GENBANK has recently been upgraded to a client/server
application for searching the sequence database and extended to include Medline bibliographic citations and the
MMDB structure database. The query is done through simple boolean operations
and hypertext links are formed by using a point-and-click graphical interface.
Links could be simple cross references between sequence and abstract to the
paper reported, between a protein sequence and its corresponding DNA sequence
or with precomputed similarities among the sequences or textual documents. The
sequences can be displayed graphically in order to visualise complex
annotations such as segmented sequences or alternative splicing in coding
regions.
With
the development of the graphical CN3D interface [47] finding a DNA/protein
structure or identifying similarities in Entrez is just like any other search.
A query can contain specific fields such as author names or text terms, extend
to sequence neighbours (homoloques) and then ask whether structural data is
available for any of the members of the identified family. The CN3D interface
to the MMDB database also provides a versatile visualisation of protein and DNA
architecture as well as functional important features (binding sites, ligands,
cofactors) without the user having to know how to extract this information from
the cartesian coordinates the structure is expressed.
The
major service providers (NCBI,EBI,DDBJ) also offer client programs for
executing similarity queries (BLAST, FASTA) directly over the Internet. Users
on the Internet can use the file transfer protocol (FTP) facilities to download
the client applications, the entire database releases or the daily updates.
Users
with access to electronic mail only can search the databases by sending a mail
message containing specific keywords or boolean combinations of text.
Electronic mail services are also provided for sequence similarity searching
where the query sequence is submitted by e-mail and the results are
electronically posted back to the sender.
3.2.2 Biocomputing
through the Web.
Further
from searching the databases through the Internet one can make use of the
dedicated servers in big biotechnology centers (EBI,NCBI, DDBJ etc) to perform
calculations other than queering databases or sequence comparisons on line
(Table 3). For example “The Protein Machine” [53] will take a DNA sequence and
try to translate it in to a protein sequence while on the “UGCG” centres [46] a
full analysis of antigenicity, hydrophobicity or folding can be performed on
line and the results to be sent by electronic mail to be printed on the users
printer.
Another
biocomputing tool Darwin [49] offers a single, integrated tool for practical
and experimental computational biochemistry. This includes loading of and
retrieval of sequences from sequence databases, fast searching for sequence
fragments, repetitions and frequent patterns, sequence alignment, pam distance and
variance estimation based on Dayhoff-like matrices. It can also perform generation
of random sequences and mutations, calculate phylogenetic trees and multiple
alignments, plot curves, trees and histograms. It also includes numeric
functions and matrix and vector arithmetic and general purpose functions for
searching and sorting, input and output functions to store and load results and
to communicate with other programs. It
is also expandable since it contains its own programming language that allows automation
of repetitive tasks and addition of new
functionalities.
In
addition most sites offer a very comprehensive catalog and store of
biocomputing tools, commercial software or from the public domain, that can be
downloaded and executed on the client. The software is cataloged according to
machine or operating system and according to category of biocomputing tools.
For example EBI (EMBLs biocomputing site)
on its BioCatalog divides software in domains that concern DNA,
Proteins, Alignments, Genetic, Mapping, Molecular Evolution, Molecular
Graphics, Databases etc, while each domain has subdomains for further classifications
(e.g Genetics is further divived into Contig assembly, Genetic and Physical
Mapping, Genome Mapping , oligomer design and synthesis and restriction maps).
A list of sites and their internet
addresses that support this service is given on table 3.
REFERENCES
1. Sanger F., Nicklen S., Coulson A.R.
(1977) PNAS USA 74,5463.
2. Cherry J.M. (1993) Current Protocols
in Molecular Biology (ed. F.M.Ausubel et al.) pp 7.7.1-7.7.31.
3. Slatko B.E., Albright L.M. and Tabor
S., (1993) Current Protocols in Molecular Biology pp 7.4.1-7.4.27.
4. Prober J. et al. (1987) Science 238,
336.
5.
Science
(1996), 272, 1509-1552.
6.
Bioimage by
B.I.Systems Corp. (info@Bioimage.com).
7. Lasergene , DNASTAR Inc.
(www.dnastar.com).
8. Maynard P.E. et al. (1992) Applied
and Theor. Electr. 3, 1-11
9. Sequence Assembler, A computer
program by Applied Biosystems Inc. (1994) (www.perkin-elmer/ab/)
10. Sequence Navigator, A computer
program by Applied Biosystems Inc.
(1994). (www.perkin-elmer/ab/)
11. Lee L.G. (1993) Nucleic Acid
Research 21, 3761-3766.
12. Mcbride L.J. et al. (1989)
Clin.Chem 35: 2196-2201.
13. Genescan, A computer program by Applied Biosystems Inc.
(1994).
14. Fries A. Eggen A. and Stranzinger
G.. (1990) Genomics 8, 403-406.
15. Genotyper, A computer program by
Applied Biosystems Inc. (1994).
16. DNASIS, Hitachi Software
(www.hitsoft.com/gs/dnasis/index.html/)
17. Gene Jockey II, Biosoft
(www.biosoft.com/biosoft/).
18. Primer-Master (1993) A computer
program by Proutski V and Sokur O.
19.
Williams N. ,Science (1997) 275, 301-302
20.
EMBL
21. Benson, D., Boguski, M., Lipman,
D.J., and Ostell, J. (1994)
Nucleic
Acids Research, 22, 3441-3444.
22. Bairoch A., and Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TREMBL(1997) Nucleic
Acids Res. 25:31-26.
23. Bernstein F.C. et al. (1977)
J.Mol.Biol. 112, 535-542.
24.
The X-ray crystallography
Protein Data Bank (PDB)
Abola E.E., Bernstein
F.C. and Koetzle
T.F.; (1988) In "Computational molecular biology. Sources and methods
for sequence analysis", Lesk A.M., Ed., pp69-81, Oxford University Press,
Oxford.
25. The PROSITE dictionary of sites and
patterns in proteins Bairoch A., Bucher P. and Hofmann K.; Nucleic Acids
Res. 24:189-196(1996).
26. Bishop et al (1987) Nucleic acid and protein sequence analysis:
A practical approach pp83-146.
27. Boguski, M.S. (1995) Trends in
Biochemical Sciences, 20, 295-296.
28. Schuler, G.D., Epstein, J.A.,
Ohkawa, H and Kans, J.A. (1996) Methods Enzymol. 266, 141-162.
29. Hogue,C.W.V., Ohkawa, H., and
Bryant, S.H. (1996) Trends Biochem.Sci. 21, 226-229.
30. dbEST
sequence and mapping data on "single-pass" cDNA sequences or
Expressed Sequence Taqs (1993) Nature Genetics 4, 332-333.
31.
Boguski, M.S., Tolstoshev, C. M. and Bassett, D. E. (1994) Science,
265, 1993-1994.
32. dbSTS (sequence and mapping data on short genomic landmark
sequences or Sequence Tagged Sites Science 245: 1434-5 (1989).
33. Pearson W.R., Lipman D.J. (1985) PNAS USA , 2444-2448.
34. Altschul, S.F., Boguski, M.S., Gish,
W., and Wootton, J.C. (1994)
Nature
Genetics, 6, 119-129.
35. Bassett, D.E., Boguski, M.S.,
Spencer, F., Reeves, R. Goebl, M., and
Hieter, P. (1995) Trends in Genetics, 11, 372-373.
36. Madden, T.L., Tatusov, R.L. and
Zhang, (1996) Methods in Enzymology, 266,131-141.
37. Henikoff S.and Henikoff J.G. (1991)
Nucleic Acid Res. 19, 6565-6572.
38. Altschul,
S.F. et al. (1992) J.Mol.Biol. 215,
403-410.
39. Smith T.F.
and Waterman M.S. (1981), J.Mol.Biol, 147, 195-197.
40. Higgins, D.G. and Sharp, P.M.
(1988) CLUSTAL: a package for
performing multiple sequence alignments
on a microcomputer. Gene
73, 237-244.
41. SCAN DNASIS Hitachi Software
(www.hitsoft.com/gs/scandna/index.html/)
42. Genetics Computer Group, Winsconsin
USA (www.gcg.com).
43. Vector NTI InforMax Inc.,
www.informaxinc.com/vectornti.html/
44. Netscape a
commercial Web viwer (
http://home.netscape.com/)
45. Mosaic a Web
viewer, product of US National Center for Supercomputing Applications (NCSA)
(www.ncsa.uiuc.edu/SGD/Software/Mosaic/ .
46. Internet
Explorer a product of Microsoft (www.microsoft.com).
47. Hogue
C.W.V. Cn3D: a new generation of three-dimensional molecular structure viewer.
Trends in Biochem. Sci. 22, 313-315 (1997).
48.
The Protein Machine (http://www.ebi.ac.uk/contrib/tommaso/translate.html).
49. Darwin, a Computational Biochemistry Research
Tool (http://cbrg.inf.ethz.ch/section3_3.html)
Table 1
Internet addresses of biological Databases
NCBI access to GenBank, Entrez and BankIt http://www.ncbi.nlm.nih.gov/
Histocompatibility Complex (MHC), http://www.ebi.ac.uk/imgt/
in humans ( referred to as the Human
Leucocyte Antigens (HLA) system
The EMBL database http://www.embl-heidelberg.de
EBI European Bioinformatics institute http://www.ebi.ac.uk/
Mitbase - http://www.uq.oz.au/nanoworld/
The Macromolecular Structure DataBase at EBI http://www2.ebi.ac.uk/msd/
EBI Mirror of the Protein Data Bank. http://www2.ebi.ac.uk/pdb/
SWISS-PROT (Amos Bairoch) http://expasy.hcuge.ch/
ReBase Restriction Enzyme Database http://www.neb.com/rebase
EMBL Nucleotide Sequence Database http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html
dbEST: A division of GenBank for cDNA
sequence and mapping data http://www.ncbi.nlm.nih.gov/dbEST/index.html
Genome Data Base (GDB) http://gdbwww.gdb.org/gdbhome.html/
TreeBASE: a databose of phylogenetic knowledge hnp://herbaria.horvard.edu/treebase/
Table 2
Internet addresses of Bioinformatics Sites
EBI European Bioinformatics Institute http://www.ebi.ac.uk/
European Mol. Biology Lab. (EMBL) http://www.embl.de/
John Hopkins Bio-Informatics Home Page http://www.gdb.org:80/hopkins.html/
Australian National University Bioinformatics http://life.anu.edu.au:80/
Bioinformatics Resource Index (Univ. of Manchester) http://mbisg2.sbc.man.ac.uk:80/gradschool/bioinf/brass95.html/
National Institutes of Health, Molecular Biology gopher://saavik.niehs.nih.gov/
Cold Spring Harbor Laboratory Information Online http://www.cshl.org:80/
Institute for Molecular Virology, Univ. of Wisconsin http://www.bocklabs.wisc.edu:80/Welcome.html/
Internet for the molecular biologist http://www.apollo.co.uk/a/horizon/#vol3
Biotechnology Software & Internet Journal http://www.orst.edu/~ahernk/bsj.html
Books for Molecular Biology http://www.apollo.co.uk/a/horizon/
Molecular Biology Jump Station http://www.ifrn.bbsrc.ac.uk/gm/lab/docs/molbiol.html/
NCBI access to GenBank, Entrez and BankIt http://www.ncbi.nlm.nih.gov/">
Pedro's Biomolecular Research Tools http://www.fmi.ch/biology/research tools.html
Cell and Molecular Biology Online: http://www.tiac.net/users/pmgannon/teaching.html
educotional and teaching resources
BioSuppIyNet: a dotabase of materials for biomedical http://www.biosupplynet.com/
research
BIOSCI/Bionet newsgroups http://schmidel.com/bionet.htm
ExPASy: a moleculor biology server dedicated to the http://expasy.hcuge.ch/
analysis of protein and nucleic acid sequences
A Biologist's Guide to the Internet http://www.csc.fi/molbio/una/unopost.index.html
Table 3
A list of biocomputing services with their internet
addresses.
Bio-wURLd is a searchable, user-maintained collection of URL's related to bioinformatics,
biochemistry, and molecular biology. http://www.ebi.ac.uk/htbin/bwurld.pl
The
Protein Machine A tool for translating from DNA to
protein which offers a selection of
translation tables and reading frame and allows you to select DNA from an EMBL
Nucleotide database entry. http://www.ebi.ac.uk/contrib/tommaso/translate.html
The
Protein Colourer a tool for colouring you amino acid
sequences. Allows you to input a
sequence or select one from SwissProt. http://www.ebi.ac.uk/htbin/visprot.pl
The Alignment Colour Viewer a tool for
colouring 2 aligned http://www.ebi.ac.uk/htbin/visalign.pl
Darwin, a Computational Biochemistry Research Tool http://cbrg.inf.ethz.ch/section3_3.html
AACompIdent allows the identification
of a protein from its amino acid
composition. It searches SWISS-PROT for proteins, whose amino acid
compositions are closest to the amino
acid composition given.) http://expasy.hcuge.ch/ch2d/aacompi.html
AACompSim is tool which allows the comparison of the amino acid composition of a SWISS-PROT entry with all other
SWISS-PROT entries so as to find the proteins whose amino acid compositions are closest to that of the
selected entry. http://expasy.hcuge.ch/ch2d/aacsim.html
GuessProt is a tool which allows the retrieval of the SWISS-PROT entries closest
to a given pI and Mw. http://expasy.hcuge.ch/www/guess-prot.html
Compute pI/Mw is a tool which allows the computation of the theoretical pI (isolectric point) and Mw (molecular weight) for a given protein stored in SWISS-PROT or for a list of SWISS-PROT entries or for a user entered sequence. http://expasy.hcuge.ch/ch2d/pi_tool.html
ScanProsite is a tool which allows to either scan a protein sequence - from
SWISS-PROT or provided by the user - for the occurence of patterns stored in
the PROSITE database or to scan the SWISS-PROT database - including weekly
releases - for the occurence of a pattern that can originate from PROSITE or be
provided by the user. http://expasy.hcuge.ch/sprot/scnpsite.html
Translate
is a tool which allows the translation of a nucleotide
(DNA/RNA) sequence to a protein sequence. http://expasy.hcuge.ch/www/dna.html
MIPS
ALERT Every day, new protein sequences are added to
the protein and nucleic acid databases. The Alert utility is designed to keep
you abreast of these changes by sending you once per week, via email, the new database entries related to your
field of interest. http://www.mips.biochem.mpg.de
CPROP Maps of Human Chromosomes CPROP
is an experimental program for doing map construction and integration. It is
based on AI methods of reasoning with constraints. Information about distances
and orders of loci derived from experiments and/or other maps are represented
by constraints, and an inference process propagates these constraints around
the map in an attempt to reduce uncertainty. A version of CPROP was described
in S. Letovsky, M.Berlyn. Genomics 1992 12:3 pp.435-446. http://gdbdoc.gdb.org/letovsky/cprop/human/maps.html
Bionet
news filter Here you can search all the available
articles in 3 selected bionet newsgoups and retrieve those containing the
regular expression you specified. http://base.icgeb.trieste.it/cgi-bin/nfilter.pl
Table 3
(contd.)
FU Berlin - List of Amino Acids 3D-Molecular model, CAS Registry Number, Isoelectric point,
Symbols, formula, structures etc. for eadch amino acid. http://www.chemie.fu-berlin.de/chemistry/bio/amino-acids.html
Atlas of Protein Side-Chain Interactions This atlas depicts how amino acid side-chains pack
against one another within the known protein structures. This packing, which is governed by the
interactions between the 20 different types of side-chains, determines the structure, function, and
stability of proteins. http://www.biochem.ucl.ac.uk/bsm/sidechains/index.html
The Dictionary of Cell Biology is intended to provide quick access to easily-understood
and cross-referenced definitions of terms frequently encountered in reading the modern biology
literature. This server contains the text of the Second edition, published in April 1995, together with
enhancements, hypertext links and new entries
which are destined for the third edition). http://www.mblab.gla.ac.uk/~julian/Dict.html
Promega Vector Sequences and Technical Literature Technical Bulletins and Manuals
supporting Promega's products are being added to our site based on the frequency with which they
are requested by customers. Sequences of all our vectors are available for on-line viewing.
http://www.promega.com/techdoc.html
BioSupplyNet
This new site is an online directory of
biomedical/lifescience research products and services and allows searching in a
database of more than 15000 products from 1400 suppliers. The site is updated
weekly and allows immediate access to suppliers via email for technical
support and ordering information. http://www.biosupplynet.com/bsn/">BioSupplyNet
"Universal"
Genetic Code as HTML table as HTML table http://golgi.harvard.edu/gencode.html
EBI
Bio-Filter allows you to select five bionet newsgroups
and search through them for a keyword
of your choice. http://www.ebi.ac.uk/cgi-bin/nfilter.pl
Webcutter An excellent on-line tool for restriction mapping nucleotide
sequences http://www.medkem.gu.se/cutter
ABIM (Université Aix-Marseille I : France) Biological servers on the Web : Databases (sequences, organisms, collections), On line analysis tools, Guides and Tutorials in Biology
http://www-biol.univ-mrs.fr/english/biology.html
Biological
Data Transport is an integrated resource for the life
sciences community. Search Genbank, Medline, Entrez, OMIM, The Biotech
Registry, vendor products and services,
and other great resources simultaneously! We add content and functionality
daily! http://www.data-transport.com
The
Biotech BiblioNet is a monthly bibliography of
recently published biotechnology
articles. http://www.azc.com/client/sage/biotech/biotech.htm
PCR
Jump Station The ultimate site for information and
links on the Polymerase Chain Reaction (PCR). http://apollo.co.uk/a/pcr
Science
Guide Web site consists of a number of different
sections designed to help the scientist
and physician find information on the internet and to sponsor communication between those interested in science. http://www.scienceguide.com
Legends to Figures
Figure 1.
Scanned gel autoradiography image displayed using the program SCAN DNASIS.
Figure 2.
Semiautomatic DNA sequencing using the program SCAN DNASIS of a scanned
sequencing gel. Each column contains the autoradiography image of the specific
dideoxynucleotide chain terminated DNA fragments.
Figure 3. Electropherogram of a DNA fragment
sequenced automaticaly using fluorescence detection on a an ABI Prism Sequence
Analysis Instrument.
Figure 4. Genome mapping using fluorescence
dye technology. Display of different size PCR fragments.
Figure 5. Microsattelite analysis using
fluorescence dye technology.
Figure 6. Located primer pairs using the
program Lasergene.
Figure 7. Display of binding energies of
located primers (from Lasergene).
Figure 8. Program output of a FASTA search of
a protein sequence on the SWISS-PROT database.
Figure 9. Program output of a CLUSTAL protein
sequence multiple alignment.
Figure 10. Display of a circular restriction
map of plasmid M13mp11. Different windows show a general view of the plasmid, the
restriction enzyme sites found and the position of the specific restriction
enzyme sites on the plasmid sequence.