Computers in DNA Technology
    

An Introduction to the software tools and bioinformatics techniques available to the biotechnology scientist.

 

 

 

By

Elias Eliopoulos

Genetics Laboratory,

Department of Agricultural Biotechnology,

The Agricultural University of Athens,

GREECE.


Computers in DNA Technology.

Contents

Introduction

1.            Computers in Experimental DNA Research.

1.1.  DNA Sequencing Software.

1.2. Genome Mapping

1.3. Primer Design Software

1.3.1 Features

1.3.2 Criteria for primer and primer pair formation

2.            Information Retrieval and Analysis.

     2.1. Making Sence of Sequences

     2.2. The Sequence Databases.

      2.2.1 The EMBL Nucleotide Sequence Database

      2.2.2 A Brief Outline of DDBJ

      2.2.3 The GenBank DNA Sequence Database

      2.2.4 The Brookhaven Protein Structure Databank

      2.2.5 The SWISS-PROT Protein Sequence Databank

      2.2.6 Some Meta databases

2.3. Sequence analysis  and retrieval programs

2.3.1 Sequence Similarity Searching Software

2.3.2 Multiple Sequence Alignments

2.3.3 Sequence Analysis Utilities

3.      The Internet.

3.1. Bioinformatics and the Internet.

         3.2. Information Retrieval Through the World Wide Web

3.2.1 Databases.

3.2.2 Biocomputing through the Web.


 

Introduction

Scientists working in the field of gene discovery often find themselves having to move from their chemistry bench to their personal computer or workstation in order to design their DNA experiments, execute them, process, analyze and understand their experimental results, communicate them to the scientific community, and use the combined knowledge to redefine the goals of their research.

In this chapter an attempt will be made to highlight various tools and techniques that enable DNA researchers through the use of the information technology to extract and manipulate genomic information. This chapter is divided in three parts.

Part 1 is involved with the computer as an indispensable experimental tool to control experiments collect and analyze raw data. In particular it refers to automatic and semiautomatic DNA sequencing, genome mapping and primer design.

Part 2 is concerned with information retrieval at the levels of DNA and protein sequences highlighting also text retrieval and structure analysis. A new fragment of DNA off the sequencer would be just another string of A,T,C and G, the four nucleotides, if  it could not be correlated with the billions of the already known and characterized fragments that are produced, analysed and documented at an increasing pace now at a million base pairs a day. Making sense of a nucleotide sequence is only the first step in the field of molecular biotechnology where understanding structure and function is the ultimate goal. Through sequence databases,  sequence analysis programs and modelling software the scientist finds ways of making the most of the experimental results and enhances the understanding in biological and molecular processes.

Part 3 describes the links between Bioinformatics and the Internet, two of computer science’s boom areas, through network computational tools and services for making sense of new DNA and protein sequences. The major DNA and Protein data containing databases are described as well as the bioinformatics centers offering services for organizing and analyzing biological data.


 

1. Computers in Experimental DNA Research.

 

1.1       DNA Sequencing Software.

 

Sequencing, protein or DNA, has always been a very cumbersome repetitive procedure  requiring long experimental times and great attention to the analysis of the details of the raw data, being a spectrometer output or a gel image [1]. From the early days of conventional protein sequencing the instrument has always been microprocessor controlled and the detector data were fed for analysis to a microcomputer that controlled the instrument through predesigned software modules and analysed the data. In early DNA sequencing the computer was introduced to perform the image analysis of the autoradiography picture of the sequencing gel (figure 1) [2]. The arrival of automated sequencers revolutionized the field of bioinformatics by enabling biotechnologists to catalogue sequence information hundreds of times faster than was possible with preexisting scanning techniques.

 

Today’s sequencing world is using two major techniques for sequencing. Manual sequencing using radioactive isotopes [3] and film exposure and automated sequencing through the use of fluorescent dyes [4].

 

The manual or semi-automatic approach involves running a gel with radioactive isotope labels, expose it to a film and then scan it and process the information off it. Autoradiography is very difficult to automate, largely because reaction products have to be run in four different gel lanes, and lane to lane variations in electrophoretic mobility were difficult to interpret. Advances in computerised image processing have helped in the automation of the last step. Once the gel data have been digitised using a hand-held or a flatbed scanner many programs have been developed to help with the interpretation in to a string of base pairs. The process first involves aligning the gel lanes using a high resolution graphical interface on the computer , in order to accommodate distortions on the gel and other geometrical artifacts (figure 2). Some programs [5,6] can sense the lanes in group of four on the electropherograms automatically and proceed to resolve ambiguities , call bases, align sequences and assemble contigs. Others [7] just provide the tools, rulers, colours, background processing or geometrical distortion correction leaving the actual choice of base to the user. One invaluable facility all the programs have is that the base sequence is automatically recorded and stored in the common sequence formats (PIR, FASTA) to be used later in other programs alleviating the cumbersome and often mistake prone process of recording the sequence, typing it and checking it. Assembly managers of that type claim that can handle 1000 individual sequences and build a contig of up to 50kbases in a couple of minutes.

 

The replacement of conventional autoradiography techniques with real time fluorescence detection [8] has revolutionised automated DNA sequencing. The use of multiple wavelength fluorescence detection allows dideoxysequencing ladders, generated using four distinct fluorescent dyes, to be analysed in a single electrophoresis lane. This eliminates the lane to lane electrophoretic variability and at the same time offers fast automatic detection through the use of laser technology and computer algorithms without significant human intervention (figure 3).

 

The automatic interpretation and storing of the DNA sequencing is the first only step in automated sequence processing. Specialised software [9] is used to clean up sequence files from vector sequences and ambiguously called bases prior assembly and analysis. After obtaining clean data, several utilities are offered to the scientist for desktop processing. These include the creation of DNA,RNA and conceptual translation of DNA/RNA codons to aminoacid sequence files, performing pairwise and multiple DNA and aminoacid sequence alignment, compare matches or mismatches between sequences and create ambiguity, unanimity or consensus sequence, search for patterns and restriction sites and editing and exporting facilities [10].

 

When many overlapping fragments of DNA have been sequenced the laborious task of sequence assembly can also be automated making sequence assembly fast and accurate.

 

 

1.2. Genome Mapping.

 

To extend further from DNA sequencing to genome mapping [11], fluorescent dye technology has helped automate techniques such as sizing and quantitation using an internal lane standard. The emission wavelength of the fluorophore-labelled internal lane standard is recognised automatically and the known fragment lengths and quantities of the standard are used by the software to create a calibration curve. PCR product sizes are determined automatically (figure 4) from the calibration curve and since they run on the same lane on the electropherogram with the standard there are no electrophoretic mobility variations. In addition the synchronous use of four different fluorescent dyes allows the multiplexing of routine PCR analyses [12].

 

Using this technology DNA fragment analysis and quantitasation can be performed fast and reliably. Flexible software and a broad linear dynamic range for fluorescence detection make data editing and interpretation easy [13], even when there is a large variation in DNA concentration between fragment bands. Fragment size can be determined precisely even if alleles differ as little as one base pair.

 

Microsattelites , also known as simple sequence tandem repeat polymorphisms, are fast becoming the markers of choice for linkage mapping of genes [14]. Because there are numerous, polymorphic and far easier to find than classical restriction fragment length polymorphisms there are steadily replacing current lengthy RFLP analyses. Their variation (alleles) can be realized as different size PCR products. These products are fluorescently labelled and can be analysed precisely and automatically. Microsatellite alleles can be determined  accurately even if alleles differ as little as one base pair and allowing to complete gene mapping three to four times faster. The processing software not only interprets the raw detector data but can automatically determine the amount of fluorescence signal through the calculation of peak height and area [15] (figure 5).

 

Utilising the fact that changes as slight as single base substitutions can alter the conformation of single-strand DNA molecules and result in different electrophoretic mobilities in a non denaturing gel allows wild type and mutant DNA , or different alleles to be readily distinguished from each other. This PCR-SSCP analysis in combination with fluorescent dye technology is fully automated through the use of computer controlled electrophoretic devices and microcomputer analysis.

 

Genotyping applications based on table formation of analysed fragment data extend further.

 

For linkage mapping a table of alleles can be constructed from the interpretation of peak data.

 

For AFLP a comparative analysis table of polymorphic peaks shows the presence or absence of marker peaks.

 

In association with a pedigree database a Mendelian inheritance check can be performed for disease gene mapping, forensic research or paternity testing.

 

 

1.3. Primer Design Software.

 

Many computer applications [16,17,18] are available to find  suitable primers or form optimal  primer's  pairs as probes  for  PCR applications. They differ in their graphical interface, ease of use or options available. The primers selected  by  the software should maximally  satisfy various requirements, providing high  specific  and  efficient  amplification, and may be used in a broad range of PCR applications, sequencing primers and hybridization probes.

 

For example a special check for repeated sequences in the beginning of the design [18]  is useful in order to avoid unspecific annealings with the DNA. This is especially necessary for small, known sequences like plasmids, viruses or mitochondrial DNA. Also tests of the annealing of the primers with themselves, with each other (for primer pairs) or with the DNA should be considered.

 

1.3.1 Features.

 

Several features of primer design are common in the various computer programs :

 

 

·      Creation of new primer pairs.

 

An automatic search and  selection  of  best primers and formation of optimal primer's pairs for  PCR can be performed. Some  programs  process  one  or  several  nucleotide sequences simultaneously and select  primers,  which  are  universal  for   all considered sequences  or  specific for only one out of all sequences. The “universal” option in the designer programs is useful for the development  of  PCR-based  diagnostic  systems. The  “specific mode” allows it to  obtain  pairs  of  primers,  providing specific amplification of fragments of only  one  nucleotide  sequence out of all proposed ones. All primers are tested in respect of their capability of binding in non-specific places (false priming) on the target sequence and on all other sequences included in the test.

 

 

·      Creating of one suitable primer to a given primer.

 

With this option the search of specific primers is performed in one or in many related  nucleotide sequences in order to form primer pairs(figure 6). The user may type-in the sequence of the oligonucleotide to be tested and besides the common symbols A, G, C and T ambiguous bases may be included.

 

 

·      Finding of unique sequences within a sequence.

 

This feature is useful in order to select primers which will amplify a certain fragment whithin a specific sequence (if working in specific mode) or  amplify respective fragments of all proposed sequences  (if  working  in universal mode). The user locates the left (5') and  right  (  3') boundaries of the target fragment on the first sequence, and  the  best primers are being searched within close zones (  100  nucleotides  ) up- and down- stream of the fragment.

The programs allow to carry out the search and selection  of primers within the interesting sites of sequences, or to  carry  out selection of primers and formation of primer's pairs, which  provide amplification of interesting fragment of sequences. This may  be  necessary for the selections of primers for  site-directed  mutagenesis or for preparation for cloning or sequencing.

 

 

·      Setting temperature limits.

 

Setting “temperature limits”  allows to set permissible lower and upper limits  for  annealing  temperature  of  primers or a permissible   difference   between annealing temperatures of primers sharing one pair . Temperatures  are determined by the nearest-neighbor method by taking into account the nucleation  constant. The annealing temperature is estimated with formulas such as

TA = 62.3 + 0,41 * %CG - 500 / ntprimer

 

 

·       Assessing of existing primers or primer pairs.

 

Assessing of existing primers or primer pairs allows testing of candidate oligonucleotides with  respect to any sequence  and/or  to  some intrinsic properties of  the  oligonucleotide  itself . For the first case the  annealing  temperature  of  the primer  connected  with possible partial uncomplementarity of the primer  in  the target sequence is given. This temperature is the temperature value to be used during the first  4-5  cycles  of  the PCR.

 

 Concerning the second,  information about the melting temperature, molecular weight and the possibility of stem-loop  structure formation is displayed. The length, annealing temperature and mol. weight of oligonucleotide is also displayed together with probable hairpins.

 

 

Some other features available but not common in all programs include:

 

·      Finding of repeats within a sequence.

 

·      Handling of long sequences up to 40.000 bp.

 

·      Looking for primers including restriction sites.

 

·      Include design of misparing primers to create restriction sites

 

 

All searches of suitable primers can be done by following different criteria.

 

a. Energetic criteria  allow  to  set  the  number  of  most preferable in energetic terms sites of sequences for binding of the primer under test. The energy values considered are MAX.ENERGY ( Gm) - the binding energy ( exactly the standard free energy of Gibbs) of the primer on the template in a specific place - and DELTA ( Gm - Gs) - the difference between the binding energies of the primer and template in a specific site and the most profitable non-specific place (difference between energetic maximum and submaximum) (figure 7).

 

b. Complementarity or homology criteria allow to find sites  of  sequences where the examined oligonucleotide  has a level of complementarity  or homology respectively equal  or exceeding a threshold set by user.

 

 

1.3.2 Criteria for primer and primer pair formation.

 

The program should pay particular attention to the  evaluation  of the specificity of the selected primers. The selected primer  must  be  able  to reliably bind and extend in a specific place of the right  sequence (if working mode is "specific") or  in  respective  sites  of  all considered sequences (if mode is "universal") and have  the  least probability of primer-template binding and extension,  in  any  other place. To estimate the reliability of specificity  and  probability  of non-specific interactions the energetic criterion is used. It is based on assumption that a certain primer interacts with  the template  DNA in a random way, then the probability and reliability of  primer-template binding is primarily defined by thermodynamic  stability  of  the duplex and accurately described by the standard free  energy  of  Gibbs (dG).

 

The program should be able to adapt to the sequences examined. Selection criteria should have stringent default values that can be adjusted (softened) automatically or manualy in cases that no pairs of primers are predicted within set upper and lower limits.

 

Besides to be highly specific, the selected primers must also have some of the following properties:

 

a. they should contain no sequence repeat structure;

b. they should contain no large GC-rich sites and no stretched repeats of the same nucleotides;

c. they should be free of hairpins (the annealing temperature of primer must be at least 25 C higher than melting temperature of most stable hairpin, which could be formed by this primer);

d. they should  not form any dimer structures;

e. their annealing temperature should be within set limits ( but not lower than 35 C and not higher than 73 C ).

f. Particular attention should be paid to the 3'-ends of primers, since their  properties  determine  the  possibility  or  impossibility   of the primer extending. For example, the universal primers  must  have  at least three complementary nucleotides  at their  3'-terminii  in  all target places on all sequences. On the other hand, during the search of an energetic submaximum ( place of most profitable nonspecific binding  ), all sites, where the primer has complementarity of even two 3'  nucleotides, are considered as  sites  of  possible  primer's  extension.

 

In order to form primer pairs additionally is required:

 

a. The difference between annealing temperatures of  both  primers to be  no more than the value set by the user;

b. The size of fragments flanked by primers is within  the limits  set  by the user;

c. Primers should not bind with each other (the maximal  possible  energy of their binding must not exceed the submaximum of each primer);

d. The 3'-end of one primer can not bind with any site of other one;

e. No one primer forms a hairpin at the annealing temperature  of  second primer in pair;

f. The length of fragments of different sequences amplified by universal primers do not differ from each other more then by a value  set  by the user.


 

2. Information Retrieval and Analysis

 

2.1 Making Sense of Sequences.

 

Over the past few years the growth of information concerning genomic data has been exponential [19]. From the few hundred protein sequences of the 80’s, databases now number 70.000 sequences or 650 million nucleotide bases. Complete genomes of bacteria and yeast have been sequenced and by the year 2005 the 3 billion bases of the human genome are expected to be deciphered.

 

In order this information to be useful it is important to be able first to locate the true genomic information in the determined nucleotide sequences, the actual genes expressing the proteins, the workhorse molecules in life, which in the case of the human genome amount only in the 3% of the human genome. 

 

Then the task is to determine the function of the protein molecules expressed by finding related genes and proteins whose function are already known. Since the functionality of a protein remains closely linked to its three-dimensional structure (the way the protein folds in its proper environment) and how its subunits interact with themselves and external ligands it is necessary to be able to define the conformation of the protein molecule by comparison with other proteins of known structure or by predicting its fold from its primary (aminoacid) sequence.

 

In order to reach the ultimate goal of fully understanding the role of a protein in a biological process one should be able to know the fine details of interactions of the molecules concerned and should be able to redesign it, to attempt to redirect or adjust its function, or to make useful ligands, inhibitors or antagonists to its natural ligand in order to benefit  from their functionality as drugs or catalysts.

 

Computational biology over the last twenty years have attacked these issues in many different ways and what follows is just a fraction of the tools available to biotechnologists to enhance their understanding of biological processes through genomic information.

 

 

2.2. The Sequence Databases.

 

 

The sequence databases fall in two categories. Those primary source databases that are made from DNA and RNA sequences collected from the scientific literature and patent applications or directly submitted from researchers and sequencing groups (EMBL [20], GENBANK [21], DDBJ, SWISSPROT [22], PDB [23,24]) and those metabases where entries are consolidated from the primary source databases after extensive processing and cross correlation of entries (TREMBL [22], EST [30], SST [32] etc.). The three major primary databases, EMBL, GENBANK and DDBJ are produced in as a collaborative effort between the European Molecular Biology Laboratory, the NCBI GenBank group (USA) and the DNA Database of Japan  (DDBJ). Each of the three groups collects a portion of the total sequence data reported worldwide, and all new and updated database entries are exchanged between the groups on a daily basis. Since detailed analyses of the databases are given elsewhere [26] a short description of the primary source databases and some metabases will follow together with some aspects of  their search engines. Table 1 contains the internet addresses of the databases described.

 

2.2.1 The EMBL Nucleotide Sequence Database

 

The EMBL Nucleotide Sequence Database is composed of DNA and RNA sequence entries. Each entry corresponds to a single contiguous sequence as contributed or reported in the literature. In many cases, entries have been assembled from several papers reporting overlapping sequence regions. Conversely, a single paper often provides data for several entries, as when homologous sequences from different organisms are compared. The database currently doubles in size approximately every 12 months and currently (June 1997) contains over 931 million bases from 1432941 sequence entries.

 

The nucleotide sequence data are generally present in the database as they have been published, subject to some conventions which have been adopted for the database as a whole. The sequences are always listed in the direction 5' to 3', regardless of the published order. Bases are numbered sequentially beginning with 1 at the 5' end of the sequence.

 

The sequences are presented in the database in a form corresponding to the biological state of the information in vivo. Thus, cDNA sequences are stored in the database as RNA sequences, even though they usually appear in the literature as DNA. For genomic data, the coding strand is stored. Data containing coding sequences on both strands are stored according to the prevailing conventions in the literature. The stored data generally correspond to wildtype sequences before mutation or genetic manipulation.

 

Sequences of tRNA molecules are stored as unmodified RNA sequences (equivalent to the mature transcript before any base modification occurs). This form (colinear with the genomic sequence) has been adopted to simplify both storage and analysis of the sequences. Thus, a modified base appears in the sequence as the corresponding unmodified base. However, each base modification is noted in the feature table, so that the mature tRNA sequence can be restored automatically by a simple computer program if this is desirable. The two-letter code used by Sprinzl and Gauss (published in Nucleic Acids Research in the first issue of each year) has been adopted for abbreviation of modified bases in the feature table.

 

2.2.2 A Brief Outline of DDBJ

 

DDBJ began DNA data bank activities in Japan in 1984 with the endorsements of the Ministry of Education, Science, Sport, and Culture and representative Japanese molecular biologists. DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from mainly Japanese researchers and to issue accession numbers to data submitters.

 

DDBJ is designed to operate as one of the International DNA Databases, including EMBL in Europe and GENBANK in the USA.

 

 2.2.3 The GenBank DNA Sequence Database

 

GenBank (21) is the U.S. National Institute’s of Health (NIH) database of all known nucleotide and protein sequences including supporting bibliographic and biological information. Since 1992 it has been based at the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, located on the NIH campus.

 

The data is made available at no cost through the Internet, either by downloading database files or by text and sequence similarity search services.

 

As of Release 90.0 in August, 1995, GenBank contained over 353,713,490 nucleotide bases from 492,483 different sequences. Although human entries predominate, constituting 54% of the total, more than 15,500 species are represented. After homo sapiens, the top five species in terms of bases include C. elegans, S. cerevisiae, mus musculus, and arabidopsis thaliana. Historically, the database has been doubling in size every 22 months, but that rate has rapidly accelerated due to the enormous growth in data from expressed sequence tags or ESTs . Over 56% of the sequences in the current release are ESTs .

 

The data in GenBank come from two primary sources: 1) authors who submit data directly to the collaborating databases, and 2) annotators at NCBI who extract the information from relevant journals. GenBank has a journal scanning operation to scan the current literature from over 3500 journals and identify sequences which have been published but were not submitted directly by authors. This operation has also proven successful in updating publication information and in identifying sequences that had been submitted confidentially and should be released.

 

Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features that identifies coding regions and other sites of biological significance, such as transcription units, sites of mutations or modifications, and repeats. Protein translations for coding regions are included in the feature table. Bibliographic references is included along with a link to the Medline unique identifier for all published sequences (27).

 

 

 

·      Retrieving GenBank Data with the ENTREZ System

 

Entrez is an integrated database retrieval system [28] which accesses DNA and protein sequence data, related MEDLINE references, a collection of genome data and 3-D structures from MMDB (29). The DNA and protein sequence data are integrated from a variety of sources, including GenBank, EMBL, DDBJ, dbEST, dbSTS, GSDB, PIR, SWISS-PROT, Protein Research Foundation (PRF), PDB and patents. The MEDLINE references are papers indexed under the NLM's Medical Subject Heading (MeSH), 'genetics'. The genome dataset includes information on the large-scale organization of completely sequenced chromosomes or genomes, such as physical and genetic maps and aligned sequences.

 

The DNA sequence, protein sequence, MEDLINE, genome and 3-D structure data are linked to provide easy traversal among the datasets. Entrez provides an entry point into sequence or bibliographic records by simple Boolean queries. From the record, hypertext links may be used to navigate through the information space using a point-and-click interface. Some of the links are simple cross-references, for example, between a sequence and the abstract of the paper in which the sequence was reported, or between a protein sequence and its corresponding DNA sequence. Other links, however, are based on computed similarities among the sequences or among the textual documents. The precomputed 'neighbors' allow very rapid access for browsing groups of related records.

 

The server/client network version of Entrez operates with a client program on a user's machine over the Internet to a server located at the NCBI. Client programs for Macintosh, PC, and Unix computers can be obtained by downloading from 'ncbi.nlm.nih.gov' in the 'entrez/network'' directory. This version combines the full functionality of the Entrez interface with access to an extended set of 1.2 million Medline citations and to the MMDB structure database. The sequences are updated daily and the MEDLINE weekly. The Web version of Entrez also provides access to the same sequence and MEDLINE datasets through standard Web browsers such as Netscape and Mosaic. Additional links are provided to external data sources such map information from FlyBase for Drosophila sequence records and to the full text of publicly available online journal articles.

 

 2.2.4 The Brookhaven Protein Data Bank (PDB) of macromolecular three dimensional structures.

 

The Protein Data Bank (PDB) [23] is an archive of experimentally determined three-dimensional structures of biological macromolecules, serving a global community of researchers, educators, and students. The archives contain atomic coordinates, bibliographic citations, primary and secondary structure information, as well as crystallographic structure factors and NMR experimental data. The PDB Newsletter and CD ROM are published quarterly.

 

 

2.2.5 The SWISS-PROT Protein Sequence Data Bank

 

The SWISS-PROT Protein Sequence Data Bank [22] is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. It contains high-quality annotation and is non-redundant. Release 34.0 of SWISS-PROT contains 59.021 sequence entries, comprising 21.210.389 amino acids abstracted from 50.052 references.

 

Each  entry corresponds  to a  single contiguous  sequence  as  contributed to  the bank  or reported in the literature. In some cases, entries have been assembled from several papers that report overlapping sequence regions.  Conversely, a  single paper  can  provide  data  for several entries,  e.g. when  related sequences from different organisms  are reported.

 

In SWISS-PROT, as in most other sequence databases, two classes of data can be  distinguished: the  core data  and  the  annotation.  For  each sequence entry  the core  data  consists  of the citation information  (bibliographical references), the  taxonomic data (description  of the  biological source  of the protein) and the  sequence  data. Except for  initiator N-terminal  methionine residues,  which  are  not  included in  a sequence when their absence from the mature sequence has  been proven,  the sequence  data correspond  to the precursor form of a protein before post-translational modifications and processing.

 

The annotation consists of the description of the following items:

    o  Function(s) of the protein

    o  Post-translational  modification(s).   For  example   carbohydrates,

       phosphorylation, acetylation, GPI-anchor, etc.

    o  Domains and  sites. For example calcium binding regions, ATP-binding

       sites, zinc fingers, homeobox, kringle, etc.

    o  Secondary structure

    o  Quaternary structure

    o  Similarities to other proteins

    o  Disease(s) associated with deficiencie(s) in the protein

o  Sequence conflicts, variants, etc.

 

The major advance of SWISSPROT is its comprehensive cross-reference to other databases. Currently 25 different databases are cross-referenced. Cross-references are  provided in  the form  of pointers to information related to  SWISS-PROT entries and found in data collections other than  SWISS-PROT. Cross-references are made in SWISS-PROT to:

 

    o  The X-ray  crystallography Protein  Data Bank  (PDB).

    o  The Mendelian  Inheritance in Man data bank (MIM).

    o  The PROSITE dictionary of sites and patterns in proteins [25]

    o  The  restriction  enzymes  database  (REBASE)

    o  The G-protein--coupled  receptor database  (GCRDb)

    o  The EcoGene section of the EcoSeq/EcoMap integrated Escherichia coli  K12 database

    o  StyGene section of the StySeq/StyMap integrated  Salmonella typhimurium  LT2 database,

    o  The gene-protein  database of  Escherichia coli  K12 (2D-gel  spots) (ECO2DBASE).

    o  The SubtiList  relational database  for the  Bacillus  subtilis  168 genome

    o  The LISTA  database of yeast (Saccharomyces cerevisiae) genes coding  for proteins.

    o  The human keratinocyte 2D gel protein database.

    o  The 2D  gel  protein  database  (SWISS-2DPAGE).

o  The Yeast Electrophoresis Protein Database (YEPD)

    o  The Harefield  Hospital 2D  gel protein databases.

    o  The  Drosophila   genome  database   (FlyBase)  .

    o  The Maize  genome database (MaizeDB) .

    o  The WormPep  database .

    o  The DictyDb  database .

    o  The Human  Retroviruses and  AIDS compilation  of nucleic  and amino acid sequences  (HIV Sequence  Database)

    o  The database  of Homology-derived  Secondary Structure  of  Proteins  (HSSP)

    o  The transcription  factor database  (Transfac) 

 

 2.2.6 Some Meta databases

 

    It is  important to  provide the users of biomolecular databases with a degree of  integration between  the  three  types  of  sequence-related databases  (nucleic  acid  sequences,  protein  sequences  and  protein  tertiary structures)  as well  as with  specialized  data  collections.

 

ESTs or 'expressed sequence tags' are the most rapidly-expanding source of new genes. Last year there were 50,214 sequences in the EST Division of GenBank (dbEST) [30] with 45% of these derived from human tissues [31]. Currently dbEST contains 328,905 sequences with 80% coming from about 50 different human tissues or cell types. The remaining 20% are derived from 43 other organisms with Arabidopsis thaliana (21,056), Caenorhabditis elegans (12,102) and Oryza satia (11,015) being abundantly represented by more than 10,000 ESTs each.

 

The STS or ‘sequence tagged sites’ division of GenBank (dbSTS) [32] continues to expand rapidly. The most notable development is the availability of approximately 2,000 gene-based STSs derived from ESTs and mapped on human genomic DNA cloned in yeast artificial chromosome (YAC) libraries by the Whitehead/MIT genome center by T. Hudson and colleagues. dbSTS contains PCR primers, reaction conditions and map locations for gene-based markers. Such data are used to support the visualization of transcript maps in the new genomes division of GenBank.

 

The UniGene collection, now accessible through NCBI's Home Page, contains more than 48,000 clusters of sequences, each representing the transcription product of a distinct human gene. With current estimates of 80,000 to 100,000 genes in the human genome, this is close to the 50% mark. The clusters are largely based on EST sequences, so most of the sequences are not complete and most of the genes have still not been characterized. But one important use of the UniGene clusters is to identify novel, nonredundant mapping candidates for generating a transcript map that identifies all coding sequences in the genome.

 

    TREMBL, a computer-annotated supplement to SWISS-PROT [22] contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT. It is organized in five subsections:

 

 a. Immunoglobulins and T-cell receptors

 b. Synthetic sequences

 c. Patent application sequences (coding sequences captured from patent applications).

 d. Small fragments (fragments with less than seven amino acids).

 e. CDS not coding for real proteins (CDS translations where strong  evidence exists that these CDS are not coding for real proteins).

 

Molecular Modelling Database (MMDB)

 

NCBI has begun producing a database of macromolecular 3-D structure information [29], specifically aimed at molecular modelling research. MMDB is based upon the data in the Brookhaven Protein Data Bank (PDB). By reorganizing and validating the PDB data, MMDB provides explicit descriptions of a biopolymer's spatial structure, its chemical organization, and the linkage between the two. With MMDB, there is a clear cross-referencing between the three-dimensional structure and the chemistry of a macromolecule. Explicit linkages have been made to the sequence entries within the ID database so that, with the appropriate graphics software, users are able to view the 3-D structure of proteins identified via text or similarity searching of the sequence database.

 

2.3 Sequence analysis and retrieval  programs .

 

The value of sequence databases depends not only on their role as repositories of information but also on their properties as computable objects for information retrieval and analysis. The volume of current sequence data is beyond any human processing capacity even for simple tasks such as looking for closely related sequences (protein or nucleic acid). Automated tools have been developed to perform repetitive tasks, such as homology searching, in a fraction of the time of manual processing and with greater accuracy, and to provide user friendly graphical interfaces to analyse the long chains of proteins and nucleic acids. The software aiming to speed and functionality falls into three categories. Computer programs to perform similarity searches of protein and nucleic acid sequences on the sequence databases. DNA design, manipulation and drawing tools with graphical interfaces including sequence alignments. And finally software that combine sequence analysis with property definition (immunogenicity, hydrophobicity, restriction mapping, etc.) and structure prediction of proteins and nucleic acids. Frequently computer analysts combine tools for all three categories in order to produce software packages automating many aspects of biological information handling.

 

2.3.1 Sequence similarity searching software.

 

Several issues have to addressed in performing similarity searches on databases. Programs have to be sensitive, to find all relevant matches, specific, in order to avoid irrelevant matches occuring by chance, efficient to perform the task in a reasonable timeframe (minutes rather than days) while keeping into consideration the biological context (rules of assembly and observations) and properties of the objects of analysis (nucleic acids or proteins).

 

Identifying sequence similarities requires a search of the query sequence against all possible positions in the target sequence and for all sequences in the database queried. Additional considerations of the presence of insertions or deletions in the sequence make the combinatorial problem very complex. The application of dynamic programming techniques with the addition of hash table or neighbourhood searches has produced combined algorithms to accelerate while significantly improving the discriminatory ability.

 

The statistical significance of a similarity score heavily relies on the scoring scheme or mutation data matrix, an numerical evaluation of what is considered similar or disimilar between aminoacids or base pairs. The protein alphabet having twenty characters is of course more sensitive than the four character alphabet of the nucleic acids in addition to the codon degeneracy of the nucleotide sequences. This implies that in searches on translated DNA sequences on to protein sequences matches are statistically more significant than the searches on non-translated  DNA sequences.

 

The ultimate purpose of sequence database searching is the identification of macromolecules with related structures and functions. Protein structure and folding is closely related to the aminoacid sequence. Since  symbolic pattern matching on the sequences is what is offered by the described similarity searching algorithms and since protein molecules can have similar three dimensional structures and yet non homologous sequences, limits to what is significantly homologous are considered. The region between 20 and 30% homology for aminoacid sequence are the grey areas where pattern matches may or may not correlate with structural and functional similarity.

 

Functional similarity maybe the result of conserved patterns of structure that may not always be close in the linear representation of the protein sequence. Modern searching programs include profile matching methodology, where entire motifs can be searched in the databases. From the other end the databases now offer through relational indexing, sufficient annotation and classification in the biological and motif context to facilitate more accurate scoring beyond symbolic pattern matching.

 

The two   most popular programs for database searching for sequence similarities are FASTA and BLAST.

 

FASTA  performs a Pearson and Lipman [33] search for similarity between a query aminoacid sequence and any group of  aminoacid  sequences that either reside in the user’s computer or is a database such as GENBANK or EMBL (figure 8). An extension to this program TFASTA does the search of the aminoacid sequence on any group of nucleotide sequences. TFASTA translates the nucleotide sequences in all six frames before performing the comparison therefore searching also all “implied aminoacid sequences”.

 

BLAST [34] (Basic Local Alignment Search Tool) is  the  heuristic search  algorithm  employed  by the programs blastp (for comparison of an amino acid query  sequence  against  a protein sequence database), blastn (for comparison of a nucleotide  query  sequence  against  a nucleotide sequence database), blastx (for comparison of  the  six-frame  conceptual   translation               products  of  a  nucleotide  query  sequence (both strands) against a protein sequence database), tblastn (for comparison of  a  protein  query  sequence  against   a nucleotide     sequence    database    dynamically translated  in  all  six  reading   frames   (both strands), and tblastx (for comparison of the six-frame translations of  a  nucleotide query sequence against the six-frame translations of a nucleotide sequence database) ; these programs ascribe  significance  to  their findings using the statistical methods of Karlin and Altschul  with  a few enhancements. The  BLAST  programs  were  tailored for sequence similarity searching -- for example to identify  homologs  to  a  query sequence [35,36].   The programs are not generally useful for motif- style searching.

 

The fundamental unit of BLAST algorithm output is the  High scoring Segment Pair (HSP).  An HSP consists of two sequence fragments of arbitrary but equal length whose  alignment  is locally  maximal  and for which the alignment score meets or exceeds a threshold or cutoff score. 

 

The approach to similarity searching taken by the BLAST programs  is  first to look for similar segments (HSPs) between the query sequence and a database sequence, then to evaluate the statistical significance of any matches that were found, and finally to report only  those  matches  that  satisfy  a user-selectable threshold of significance..

 

The default scoring matrix used by blastp, blastx,  tblastn, and  tblastx  is the BLOSUM62 matrix [37].   Several PAM (point  accepted  mutations  per  100  residues) amino  acid  scoring  matrices  are  provided  in  the BLAST  software distribution,  including  the  PAM40,  PAM120,  and PAM250.  While the BLOSUM62 matrix is a good general purpose  scoring matrix and is the default matrix used by  the  BLAST  programs,  if  one  is  restricted to using only PAM scoring matrices, then the PAM120 is recommended for general protein similarity  searches [38] .

 

Regardless of the scoring  scheme  employed,  two  stringent criteria  must  be  met in order to be able to calculate the Karlin-Altschul parameters.  First,  given  the residue  composition  for the query sequence and the residue          composition assumed for the database,  the  alignment  score expected  for  any  randomly  selected pair of residues (one from the query sequence and one from the database)  must  be negative.   Second,  given the sequence residue compositions and the scoring scheme, a positive score must be possible to achieve.  

 

 BLITZ is an automatic electronic mail server for the MPsrch program. MPsrch performs sensitive and extremely fast comparisons of protein sequences against the Swiss-Prot protein sequence database using the Smith and Waterman best local similarity algorithm [39]. It runs on the MasPar family of massively parallel computers. A typical search time for a query sequence of 400 amino acids is approximately 40 seconds to search the entire Swiss-Prot 23 release. Additional time is required to reconstruct the alignments; the time for this will depend on the number of alignments requested. MPsrch uses the well known Smith and Waterman algorithm for searching the database. The algorithm looks for the best "local" (short matching regions such as binding sites, in the middle of long sequences) match as determined by the amino acid similarity weights to the different possible pairs of aligned residues and known as PAM matrices and a penalty that is subtracted from the alignment score for every residue that has been inserted or deleted in the best local alignment.

 

2.3.2 Multiple sequence alignments (MSA).

 

Once a related to the query sequence is identified, several other significant matches is likely to be found. The multiple alignment of a group of sequences or a family of proteins provides more information than the alignment of any pair of sequences. MSA is a generalisation of the pairwise sequence alignment, but the complexity of the calculation grows exponentially with the increase of the sample of sequences considered. An approximate algorithm uses a trial multiple sequence alignment as the basis for realigning each of the component sequences iteratively, until each is an optimal alignment. Other approaches incorporate into the multiple alignment chemical and biological knowledge thus greatly improving the reliability of results.

 

The most popular software for multiple alignment is CLUSTAL [40], an automated program that also provides  generation of phylogenetic trees and profile alignment (figure 9). Others allow the user to interactively, with the use of computer graphics, define, lock and shift different portions of the sequence data and thus guiding the alignment process.

 

Profile similarity searching is closely linked to multiple alignments. A profile, representing a group of aligned sequences, or sequence motif such as those defined in the PROSITE Dictionary of Protein Sites [ 25], is used as a probe to search a database for new sequences with similarity to the profile or the motif. Several profile generation and searching programs exist and are usually linked to the major databases or the multiple alignment software.

 

2.3.3. Sequence analysis utilities.

 

A database search or a multiple sequence alignment is only the first step  in the sequence analysis process. Several graphics based programs offer numerous tools for efficient organisation, manipulation and indexed storing of sequence information. Powerful display windows allow easy editing and analysis through algorithms simulating molecular biological processes and techniques.

 

In analysing nucleic acid sequences, features include drawing and manipulating of plasmid, circular and linear restriction maps (figure 10), restriction enzyme selection and tabling, RNA folding and ORF analysis [16]. Some specialised programs also offer a gel electrophoresis pattern display [41]

 

For protein sequences, profile alignment is followed by secondary structure prediction, hydrophobicity and immunogenicity profiles, helical wheels, aminoacid content and molecular weight determination, phylogenetic trees and numerous others [42].

 

Editing facilities incorporate fragment assembly, contig and translation and reverse translation editors, consensus sequence generation, publication tools and input/output format handling [43].

 

At the moment of writing more than twenty commercial and hundreds of public domain computer programs are available to genomic research. The choice is enormous in the graphics tools offered for manipulating, storing and organising genetic data, efficiency and accuracy of implemented algorithms and convenience and ease of use. The most popular packages are those that take care of the details efficiently and allow the user to follow his thought process. In that helps interconnectivity with standard experimental and database formats and modularity with wide functionality. Whole databases describing the available software exist and perhaps is the best starting point to the novel user (Table 2).

 


 

3.                     Bioinformatics and Network Services.

3.1.  The Internet.

 

The Internet is a global communication system that through the use of computer networks connects million of users and provides access to enormous amounts of information and services. Its infrastructure is based on fast telecommunication lines transferring digital data in the form of electronic packets labelled with a specific direction. These packets are forwarded individually by adjacent computers on the network, acting as routers, and are reassembled in their original form at their destination. Packet switching allows multiple users to send information across the network both efficiently and simultaneously. Half of the Internet Networks are commercial and one third are associated with educational and research institutions.

 

For most scientists the Internet started as a tool for scientific collaborations through the use of facilities such as the electronic mail, the file transfer (ftp), the direct link to remote computers (telnet) and more recently the World Wide Web (WWW). The WWW is a hypertext based internet service providing information and resources through the use of specialised software but with globally agreed communication protocols, the browsers. The browsers offer the graphical user interface (html language) and the communication tools (protocols) for a computer (the client) to link to another computer (the server) and exchange information across the network. The information may be in the form of text, numerical data, image, sound, video or anything else that can be represented with a string of numbers.

 

Since the power of the local computer (the client) can also be used to process the raw information locally, more recently a new approach, a Web programming language(Java) has been developed, where some of the software (the applets)  is downloaded to the client and is used to process the downloaded information. With this approach the data incompatibility problem is resolved in addition to the development of sophisticated display and searching tools.

 

For the biotechnologist the parallel growth of bioinformatics and network services has created a revolution in the way of thinking, extracting and processing the vast amounts of biological information available. The World Wide Web enables scientists working on virtually any computer to access remote databases and run queries and still receive their results in minutes.

The researcher can remain up to date on  bibliography, by browsing recent journals or use the commercial or public services offered, without leaving the laboratory. Before investing time in designing experiments he can query the nightly updated databases, on line, to identify clues for the DNA sequence, the protein structure, the evolutionary pattern (and endless others) of interest.

 

The list of services for the molecular biologist on the network cannot be described on a few pages. Instead the preferable way is through network browsing with a starting point on several service providers of biological information such as those given on table 2 Instead the network use of the previously  described databases will be highlighted as well as the use of some biocomputing tools that are found and used on the network.

 

 

3.2.  Information retrieval through the World Wide Web.

 

 

3.2.1. Databases.

 

The explosive growth of the DNA databases is causing serious problems for those wishing to use them on microcomputers. The current GENBANK release can hardly fit of 5 CD (each of 600 Mbytes) and that is only for the raw data. The distribution is quarterly while in its residence (NIH), as with all major databases, is being updated nightly. Searching a database across the network on big service providers can take just seconds and queries can often be processed and received back within minutes.

 

Most of the described on section 2.2. DNA and protein databases can be searched on the World Wide Web (WWW) through the home page of the resident site or other mirror sites all over the word by using a general browser, such as Netscape [44], Mosaic [45] or Microsoft Internet Explorer [46]. A general hypertext form is provided by the web server where the user can choose the databases or subsets used, the parameters for searching and facilities to import the query sequence or text. The WWW addresses of such service providers is given on table3.

 

Furthermore through special server/client implementations databases can be searched by a client resident program combining the full functionality of specific interfaces with access to an extended and frequently updated group of databases.

 

The Entrez interface to GENBANK has recently been upgraded to a client/server application for searching the sequence database  and extended to include Medline bibliographic citations and the MMDB structure database. The query is done through simple boolean operations and hypertext links are formed by using a point-and-click graphical interface. Links could be simple cross references between sequence and abstract to the paper reported, between a protein sequence and its corresponding DNA sequence or with precomputed similarities among the sequences or textual documents. The sequences can be displayed graphically in order to visualise complex annotations such as segmented sequences or alternative splicing in coding regions.

 

With the development of the graphical CN3D interface [47] finding a DNA/protein structure or identifying similarities in Entrez is just like any other search. A query can contain specific fields such as author names or text terms, extend to sequence neighbours (homoloques) and then ask whether structural data is available for any of the members of the identified family. The CN3D interface to the MMDB database also provides a versatile visualisation of protein and DNA architecture as well as functional important features (binding sites, ligands, cofactors) without the user having to know how to extract this information from the cartesian coordinates the structure is expressed.

 

The major service providers (NCBI,EBI,DDBJ) also offer client programs for executing similarity queries (BLAST, FASTA) directly over the Internet. Users on the Internet can use the file transfer protocol (FTP) facilities to download the client applications, the entire database releases or the daily updates.

 

Users with access to electronic mail only can search the databases by sending a mail message containing specific keywords or boolean combinations of text. Electronic mail services are also provided for sequence similarity searching where the query sequence is submitted by e-mail and the results are electronically posted back to the sender.

 

 

3.2.2 Biocomputing through the Web.

 

 

Further from searching the databases through the Internet one can make use of the dedicated servers in big biotechnology centers (EBI,NCBI, DDBJ etc) to perform calculations other than queering databases or sequence comparisons on line (Table 3). For example “The Protein Machine” [53] will take a DNA sequence and try to translate it in to a protein sequence while on the “UGCG” centres [46] a full analysis of antigenicity, hydrophobicity or folding can be performed on line and the results to be sent by electronic mail to be printed on the users printer.

 

Another biocomputing tool Darwin [49] offers a single, integrated tool for practical and experimental computational biochemistry. This includes loading of and retrieval of sequences from sequence databases, fast searching for sequence fragments, repetitions and frequent patterns, sequence alignment, pam distance and variance estimation based on Dayhoff-like matrices. It can also perform generation of random sequences and mutations, calculate phylogenetic trees and multiple alignments, plot curves, trees and histograms. It also includes numeric functions and matrix and vector arithmetic and general purpose functions for searching and sorting, input and output functions to store and load results and to communicate with other  programs. It is also expandable since it contains its own programming language that allows automation of  repetitive tasks and addition of  new  functionalities.

 

 

In addition most sites offer a very comprehensive catalog and store of biocomputing tools, commercial software or from the public domain, that can be downloaded and executed on the client. The software is cataloged according to machine or operating system and according to category of biocomputing tools. For example EBI (EMBLs biocomputing site)  on its BioCatalog divides software in domains that concern DNA, Proteins, Alignments, Genetic, Mapping, Molecular Evolution, Molecular Graphics, Databases etc, while each domain has subdomains for further classifications (e.g Genetics is further divived into Contig assembly, Genetic and Physical Mapping, Genome Mapping , oligomer design and synthesis and restriction maps). A list of sites  and their internet addresses that support this service is given on table 3.

 

 

 


REFERENCES

 

1. Sanger F., Nicklen S., Coulson A.R. (1977) PNAS USA 74,5463.

 

2. Cherry J.M. (1993) Current Protocols in Molecular Biology (ed. F.M.Ausubel et al.) pp 7.7.1-7.7.31.

 

3. Slatko B.E., Albright L.M. and Tabor S., (1993) Current Protocols in Molecular Biology pp 7.4.1-7.4.27.

 

4. Prober J. et al. (1987) Science 238, 336.

 

5.     Science (1996), 272, 1509-1552.

 

6.     Bioimage by B.I.Systems Corp.  (info@Bioimage.com).

 

7. Lasergene , DNASTAR Inc. (www.dnastar.com).

 

8. Maynard P.E. et al. (1992) Applied and Theor. Electr. 3, 1-11

 

9. Sequence Assembler, A computer program by Applied Biosystems Inc. (1994) (www.perkin-elmer/ab/)

 

10. Sequence Navigator, A computer program by  Applied Biosystems Inc. (1994). (www.perkin-elmer/ab/)

 

11. Lee L.G. (1993) Nucleic Acid Research 21, 3761-3766.

 

12. Mcbride L.J. et al. (1989) Clin.Chem 35: 2196-2201.

 

13. Genescan, A computer program by Applied Biosystems Inc. (1994).

 

14. Fries A. Eggen A. and Stranzinger G.. (1990) Genomics 8, 403-406.

 

15. Genotyper, A computer program by Applied Biosystems Inc. (1994).

 

16. DNASIS, Hitachi Software (www.hitsoft.com/gs/dnasis/index.html/)

 

17. Gene Jockey II, Biosoft (www.biosoft.com/biosoft/).

 

18. Primer-Master (1993) A computer program by Proutski V and Sokur O.

 

19.   Williams N. ,Science (1997) 275, 301-302

 

20. EMBL

 

21. Benson, D., Boguski, M., Lipman, D.J., and Ostell, J.  (1994)

Nucleic Acids Research, 22, 3441-3444.

 

22. Bairoch A., and Apweiler R.  The SWISS-PROT protein sequence data  bank and its supplement TREMBL(1997) Nucleic Acids Res. 25:31-26.

 

23. Bernstein F.C. et al. (1977) J.Mol.Biol. 112, 535-542.

 

24.  The X-ray  crystallography Protein  Data Bank  (PDB)  Abola   E.E.,  Bernstein   F.C.  and  Koetzle  T.F.; (1988) In "Computational molecular  biology. Sources  and methods for sequence analysis", Lesk A.M., Ed., pp69-81, Oxford University Press, Oxford.

 

25. The PROSITE dictionary of sites and patterns in proteins  Bairoch  A., Bucher P. and Hofmann K.; Nucleic Acids Res. 24:189-196(1996).

 

26. Bishop et al (1987)  Nucleic acid and protein sequence analysis: A practical approach pp83-146.

 

27. Boguski, M.S. (1995) Trends in Biochemical Sciences, 20, 295-296.

 

28. Schuler, G.D., Epstein, J.A., Ohkawa, H and Kans, J.A. (1996) Methods Enzymol. 266, 141-162.

 

29. Hogue,C.W.V., Ohkawa, H., and Bryant, S.H. (1996) Trends Biochem.Sci. 21, 226-229.

 

30. dbEST  sequence and mapping data on "single-pass" cDNA sequences or Expressed Sequence Taqs (1993) Nature Genetics 4, 332-333.

31. Boguski, M.S., Tolstoshev, C. M. and Bassett, D. E. (1994) Science,

265, 1993-1994.

 

32. dbSTS (sequence and mapping data on short genomic landmark sequences or Sequence Tagged Sites Science 245: 1434-5 (1989).

 

33. Pearson W.R., Lipman D.J. (1985) PNAS USA , 2444-2448.

 

34. Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C.  (1994)

Nature Genetics, 6, 119-129.

 

35. Bassett, D.E., Boguski, M.S., Spencer, F., Reeves, R. Goebl, M., and

Hieter, P.  (1995) Trends in Genetics, 11, 372-373.

 

36. Madden, T.L., Tatusov, R.L. and Zhang, (1996) Methods in Enzymology, 266,131-141.

 

37. Henikoff S.and Henikoff J.G. (1991) Nucleic Acid Res. 19, 6565-6572.

 

38. Altschul, S.F. et al. (1992)  J.Mol.Biol. 215, 403-410.

 

39. Smith T.F. and Waterman M.S. (1981), J.Mol.Biol, 147, 195-197.

 

40. Higgins, D.G. and Sharp, P.M. (1988)  CLUSTAL: a package for

performing multiple sequence alignments on a microcomputer.  Gene

73, 237-244.

 

41. SCAN DNASIS Hitachi Software (www.hitsoft.com/gs/scandna/index.html/)

 

42. Genetics Computer Group, Winsconsin USA (www.gcg.com).

 

43. Vector NTI InforMax Inc., www.informaxinc.com/vectornti.html/

 

44. Netscape a commercial Web viwer (  http://home.netscape.com/)          

 

45. Mosaic a Web viewer, product of US National Center for Supercomputing Applications  (NCSA) (www.ncsa.uiuc.edu/SGD/Software/Mosaic/ .

 

46. Internet Explorer a product of Microsoft (www.microsoft.com).

 

47. Hogue C.W.V. Cn3D: a new generation of three-dimensional molecular structure viewer. Trends in Biochem. Sci. 22, 313-315 (1997).

 

48. The Protein Machine (http://www.ebi.ac.uk/contrib/tommaso/translate.html).

 

49. Darwin, a Computational Biochemistry Research Tool (http://cbrg.inf.ethz.ch/section3_3.html)

 

 

 


 

Table 1

 

Internet addresses of biological Databases

 

 NCBI access to GenBank, Entrez and BankIt                http://www.ncbi.nlm.nih.gov/

Histocompatibility Complex (MHC),                   http://www.ebi.ac.uk/imgt/

 in humans ( referred to as the Human

 Leucocyte Antigens (HLA) system                                                                                                  

The EMBL database                                                                http://www.embl-heidelberg.de

EBI European Bioinformatics institute                 http://www.ebi.ac.uk/

Mitbase -                                                                               http://www.uq.oz.au/nanoworld/

The Macromolecular Structure DataBase at EBI                 http://www2.ebi.ac.uk/msd/

EBI Mirror of the Protein Data Bank.                      http://www2.ebi.ac.uk/pdb/

SWISS-PROT (Amos Bairoch)                                    http://expasy.hcuge.ch/
ReBase Restriction Enzyme Database                                 http://www.neb.com/rebase

EMBL Nucleotide Sequence Database               http://www.ebi.ac.uk/ebi_docs/embl_db/ebi/topembl.html

dbEST: A division of GenBank for cDNA

sequence and mapping data                                        http://www.ncbi.nlm.nih.gov/dbEST/index.html

Genome Data Base (GDB)                                                           http://gdbwww.gdb.org/gdbhome.html/

TreeBASE: a databose of phylogenetic knowledge                 hnp://herbaria.horvard.edu/treebase/

 

 

 


 

Table  2

 

Internet addresses of Bioinformatics Sites

 

EBI European Bioinformatics Institute          http://www.ebi.ac.uk/

European Mol. Biology Lab. (EMBL)      http://www.embl.de/

John Hopkins Bio-Informatics Home Page                            http://www.gdb.org:80/hopkins.html/

Australian National University Bioinformatics                 http://life.anu.edu.au:80/

Bioinformatics Resource Index (Univ. of Manchester)  http://mbisg2.sbc.man.ac.uk:80/gradschool/bioinf/brass95.html/

National Institutes of Health, Molecular Biology                 gopher://saavik.niehs.nih.gov/

Cold Spring Harbor Laboratory Information Online                 http://www.cshl.org:80/

Institute for Molecular Virology, Univ. of Wisconsin    http://www.bocklabs.wisc.edu:80/Welcome.html/

Internet for the molecular biologist                                                http://www.apollo.co.uk/a/horizon/#vol3

Biotechnology Software & Internet Journal                 http://www.orst.edu/~ahernk/bsj.html

Books for Molecular Biology                                  http://www.apollo.co.uk/a/horizon/

Molecular Biology Jump Station                                   http://www.ifrn.bbsrc.ac.uk/gm/lab/docs/molbiol.html/

NCBI access to GenBank, Entrez and BankIt                  http://www.ncbi.nlm.nih.gov/">

Pedro's Biomolecular Research Tools                    http://www.fmi.ch/biology/research tools.html

Cell and Molecular Biology Online:                                    http://www.tiac.net/users/pmgannon/teaching.html

educotional and teaching resources

BioSuppIyNet: a dotabase of materials for biomedical                 http://www.biosupplynet.com/

research

BIOSCI/Bionet newsgroups                                          http://schmidel.com/bionet.htm

 

ExPASy: a moleculor biology server dedicated to the                 http://expasy.hcuge.ch/

analysis of protein and nucleic acid sequences

A Biologist's Guide to the Internet                                  http://www.csc.fi/molbio/una/unopost.index.html

 

 

 

 


 

Table 3

 

A list of biocomputing services with their internet addresses.

 

Bio-wURLd is a searchable, user-maintained collection of URL's related to bioinformatics,

biochemistry, and molecular biology. http://www.ebi.ac.uk/htbin/bwurld.pl

 

The Protein Machine A tool for translating from DNA to protein which offers a selection  of translation tables and reading frame and allows you to select DNA from an EMBL Nucleotide      database entry. http://www.ebi.ac.uk/contrib/tommaso/translate.html

 

The Protein Colourer a tool for colouring you amino acid sequences. Allows you to input  a sequence or select one from SwissProt. http://www.ebi.ac.uk/htbin/visprot.pl

 

 The Alignment Colour Viewer a tool for colouring 2 aligned http://www.ebi.ac.uk/htbin/visalign.pl

 

Darwin, a Computational Biochemistry Research Tool http://cbrg.inf.ethz.ch/section3_3.html

 

 AACompIdent allows the identification of a protein from its amino acid  composition. It searches SWISS-PROT for proteins, whose amino acid compositions are closest to  the amino acid composition given.) http://expasy.hcuge.ch/ch2d/aacompi.html

 

AACompSim is tool which allows the comparison of the amino acid composition  of a SWISS-PROT entry with all other SWISS-PROT entries so as to find the proteins whose amino  acid compositions are closest to that of the selected entry. http://expasy.hcuge.ch/ch2d/aacsim.html

        

GuessProt is a tool which allows the retrieval of the SWISS-PROT entries closest to a given pI and Mw. http://expasy.hcuge.ch/www/guess-prot.html

 

Compute pI/Mw is a tool which allows the computation of the theoretical pI (isolectric point) and Mw (molecular weight) for a given protein stored in SWISS-PROT or for a list of SWISS-PROT entries or for a user entered sequence. http://expasy.hcuge.ch/ch2d/pi_tool.html

        

ScanProsite is a tool which allows to either scan a protein sequence - from SWISS-PROT or provided by the user - for the occurence of patterns stored in the PROSITE database or to scan the SWISS-PROT database - including weekly releases - for the occurence of a pattern that can originate from PROSITE or be provided by the user. http://expasy.hcuge.ch/sprot/scnpsite.html

 

Translate is a tool which allows the translation of a nucleotide (DNA/RNA) sequence to a protein sequence. http://expasy.hcuge.ch/www/dna.html

 

MIPS ALERT Every day, new protein sequences are added to the protein and nucleic acid databases. The Alert utility is designed to keep you abreast of these changes by sending you once  per week, via email, the new database entries related to your field of interest. http://www.mips.biochem.mpg.de

 

 CPROP Maps of Human Chromosomes CPROP is an experimental program for doing map construction and integration. It is based on AI methods of reasoning with constraints. Information about distances and orders of loci derived from experiments and/or other maps are represented by constraints, and an inference process propagates these constraints around the map in an attempt to reduce uncertainty. A version of CPROP was described in S. Letovsky, M.Berlyn. Genomics 1992 12:3 pp.435-446. http://gdbdoc.gdb.org/letovsky/cprop/human/maps.html

 

Bionet news filter Here you can search all the available articles in 3 selected bionet newsgoups and retrieve those containing the regular expression you specified. http://base.icgeb.trieste.it/cgi-bin/nfilter.pl

Table 3 (contd.)

 

FU Berlin - List of Amino Acids 3D-Molecular model, CAS Registry Number, Isoelectric point,

Symbols, formula, structures etc. for eadch amino acid. http://www.chemie.fu-berlin.de/chemistry/bio/amino-acids.html

 

Atlas of Protein Side-Chain Interactions This atlas depicts how amino acid side-chains pack

against one another within the known protein structures. This packing, which is governed by the

interactions between the 20 different types of side-chains, determines the structure, function, and

stability of proteins. http://www.biochem.ucl.ac.uk/bsm/sidechains/index.html

 

The Dictionary of Cell Biology is intended to provide quick access to easily-understood

and cross-referenced definitions of terms frequently encountered in reading the modern biology

literature. This server contains the text of the Second edition, published in April 1995, together with

enhancements, hypertext links and new entries which are destined for the third edition). http://www.mblab.gla.ac.uk/~julian/Dict.html

 

Promega Vector Sequences and Technical Literature Technical Bulletins and Manuals

supporting Promega's products are being added to our site based on the frequency with which they

are requested by customers. Sequences of all our vectors are available for on-line viewing.

 http://www.promega.com/techdoc.html

 

BioSupplyNet This new site is an online directory of biomedical/lifescience research products and services and allows searching in a database of more than 15000 products from 1400 suppliers. The site is updated weekly and allows immediate access to suppliers via email for technical support  and ordering information. http://www.biosupplynet.com/bsn/">BioSupplyNet

 

"Universal" Genetic Code as HTML table as HTML table http://golgi.harvard.edu/gencode.html

 

EBI Bio-Filter allows you to select five bionet newsgroups and search through them  for a keyword of your choice. http://www.ebi.ac.uk/cgi-bin/nfilter.pl

 

Webcutter An excellent on-line tool for restriction mapping nucleotide sequences http://www.medkem.gu.se/cutter

 

ABIM (Université Aix-Marseille I : France) Biological servers on the Web : Databases  (sequences, organisms, collections), On line analysis tools, Guides and Tutorials in Biology

 http://www-biol.univ-mrs.fr/english/biology.html

 

Biological Data Transport is an integrated resource for the life sciences community. Search Genbank, Medline, Entrez, OMIM, The Biotech Registry, vendor  products and services, and other great resources simultaneously! We add content and functionality daily!  http://www.data-transport.com

 

The Biotech BiblioNet is a monthly bibliography of recently published  biotechnology articles. http://www.azc.com/client/sage/biotech/biotech.htm

 

PCR Jump Station The ultimate site for information and links on the Polymerase Chain Reaction (PCR). http://apollo.co.uk/a/pcr

 

Science Guide Web site consists of a number of different sections designed  to help the scientist and physician find information on the internet and to sponsor communication  between those interested in science. http://www.scienceguide.com


Legends to Figures

 

 

Figure 1. Scanned gel autoradiography image displayed using the program SCAN DNASIS.

 

Figure 2. Semiautomatic DNA sequencing using the program SCAN DNASIS of a scanned sequencing gel. Each column contains the autoradiography image of the specific dideoxynucleotide chain terminated DNA fragments.

 

Figure 3. Electropherogram of a DNA fragment sequenced automaticaly using fluorescence detection on a an ABI Prism Sequence Analysis Instrument.

 

Figure 4. Genome mapping using fluorescence dye technology. Display of different size PCR fragments.

 

Figure 5. Microsattelite analysis using fluorescence dye technology.

 

Figure 6. Located primer pairs using the program Lasergene.

 

Figure 7. Display of binding energies of located primers (from Lasergene).

 

Figure 8. Program output of a FASTA search of a protein sequence on the SWISS-PROT database.

 

Figure 9. Program output of a CLUSTAL protein sequence multiple alignment.

 

Figure 10. Display of a circular restriction map of plasmid M13mp11. Different windows show a general view of the plasmid, the restriction enzyme sites found and the position of the specific restriction enzyme sites on the plasmid sequence.