The Human Genome
Project might get the majority of the attention from the
popular press, but scientists have also been busy generating
sequence data from 32,000 other species. These sequences come from
well-organized, multicenter sequencing projects, and from the
efforts of individual researchers. Some projects are still in the
preliminary, but important, mapping stages. Others, including
those for E. coli and Saccharomyces cerevisiae, have
completed their sequencing efforts.
The availability of
this information on the Web has led to a new era. Without lifting
a pipetman, researchers can virtually clone the C. elegans
homologue of a Drosophila transcription factor, look for
disease-resistant genes in maize, or compare virulence factors in
Pseudomonas species. The Web provides free access to much
of this sequencing information, but it can take a little work to
find the sites and learn the various programs used for displaying
sequence data.
The best place to start when looking for nucleotide sequences
is the National Center for
Biotechnology Information (NCBI) at the National Library of Medicine.
This site provides several sequence search and comparison options.
Visitors can navigate easily among the different services, which
often use similar formats for data input.
Late in 1992, NCBI
became responsible for maintaining the GenBank
database of nucleotide sequences. Visitors get several options for
searching GenBank, including the Entrez
browser. With Entrez, users can limit this text-based search to
specific text fields, such as organism or accession, and can
refine search results to narrow their focus.
Another method of locating nucleotide sequences from GenBank
and other databases is a search tool called BLAST (Basic Local
Alignment Search Tool). This search takes a sequence of interest -
either protein or nucleotide - and pulls out similar sequences
from a specified database. The Advanced
BLAST page allows organism-specific searches. NCBI's Microbial
Genomes BLAST Databases is a search tool for unfinished
microbial genome projects.
If a search in Entrez
pulls up an overwhelming number of entries, you might try the UniGene
database. Currently covering only human,
mouse,
and rat
sequences, each database contains a "nonredundant set of
gene-oriented clusters." An entry for a specific gene might
include links to expressed sequence tags (ESTs), mRNA sequences,
expression information, and the relevant Online Mendelian
Inheritance in Man (OMIM) entry.
A few other useful sites from NCBI are the Taxonomy
Browser, the Database of
Expressed Sequence Tags (dbEST), which contains more than two
million ESTs, and Entrez
Genomes. This last resource displays sequence data from
completed genome projects, as well as mapping information from
others, using a graphical interface. Taxonomic groupings and a
search feature are both used to access the data. The Genomes
help page gives a clearly written overview of the site and
details about using the graphical interface and sequence
reports.
Because GenBank might
not contain the latest sequencing information from a specific
project, it can be useful to look at that project's Web site. You
can find such sites at the Agricultural
Genome Information System, which was produced by the Department
of Plant Biology at the University of Maryland at College Park
and the Genome Informatics group at the National Agricultural Library.
This site functions as a well-organized gateway to genome
projects. Although the focus is on livestock and commercially
important plants, other organisms are included. The page Databases at the
National Agricultural Library provides access to genome
information for various groups of organisms, as well as several
links to the Crop
Genome Databases at Cornell University.
The Agricultural Genome Information System databases, as well
as those of several other genome groups, use the so-called ACEDB
format for presenting genome information. Originally developed for
the C. elegans sequencing project, ACEDB (A C.
elegans DataBase) is a powerful graphical interface that
displays mapping and sequencing information. Spending a little
time with a tutorial or instructional site can make genome surfing
a much more efficient process. A few sites to look at are The
Plant Genome Database tutorial, the ACEDB
Documentation Library, and the ACEDB
FAQ.
The subtitle for FlyBase - A Database of
the Drosophila Genome - is an understated description of
this comprehensive resource. The FlyBase Consortium, along with
the Berkeley and European Drosophila
Genome Projects, is integrating sequence information with stock
descriptions, transposon insertion maps, gene annotation, a
bibliography, and chromosomal aberrations. Gene entries are
cross-linked with The
Interactive Fly, which includes detailed descriptions of the
biological functions of Drosophila gene products.
To most nonscientists,
the phrase "genome project" is synonymous with the Human Genome
Project. In the United States, this effort is being funded by the
Department of Energy (DOE) and
the National Human Genome
Research Institute of the National Institutes of Health. The
DOE
Human Genome Program: Research in Progress site maintains
several resources for researchers. The Human Chromosome
Launchpad, for instance, organizes links to genetic
information and databases by chromosome number. The well-organized
Links
to the Genetic World page focuses on international human
genome sites.
The Genome Database, a
repository for human sequence related information, functions as a
"community-curated database." Its organizers at the Johns Hopkins School of
Medicine and the Bioinformatics
Supercomputing Centre at the Hospital for Sick Children in
Toronto invite researchers to add to this extensive encyclopedia
of genome material. If the many possible search
options appear overwhelming, look at the example
searches for useful suggestions on getting started.
GenomeNet, established
under Japan's Human Genome Project, is a gateway to genome
databases. Along with sequence information from humans and other
species, the site includes access to the powerful KEGG resource (Kyoto
Encyclopedia of Genes and Genomes), a database of metabolic and
regulatory pathways.
One of the only other
sequencing efforts to make it into the popular press is the C.
elegans Genome Project. In December 1998, the two
collaborators on this project, The Sanger Centre and the Genome Sequencing
Center at Washington University at St. Louis, declared
victory; they had sequenced almost 90 percent of the C.
elegans genome.
Caenorhabditis
elegans Genetics and Genomics, a part of the Caenorhabditis elegans WWW
Server, is the gateway to C. elegans genome
information. This site incorporates information from both
sequencing centers. One can run BLAST searches on released and
unreleased - that is, not yet posted to GenBank - sequences, and
multiple mirror sites provide access to ACEDB.
The Saccharomyces
Genome Database maintains a Sequence
Analysis and Tools page, yeast
maps, and the SGD
Worm-Yeast Protein Comparison. The latter resource capitalizes
on these two completed sequencing projects. Researchers can use
either C. elegans or S. cerevisiae gene names or
keywords to pull out related members from the other organism.
Results show sequence alignments and protein similarity trees.
The Pseudomonas
Genome Project - a collaboration of the Cystic Fibrosis Foundation, the University of
Washington Genome Center, and PathoGenesis Corporation -
had completed 99 percent of its sequencing task by September 1998.
Faced with the labor-intensive job of annotation, they turned to
researchers in the field and developed the Pseudomonas
aeruginosa Community Annotation Project. If successful,
this project may serve as a model for other sequencing groups.
Sequencing projects around the world are advancing methodically
toward their respective goals. The progress they make translates
into more information for researchers, as they accumulate new
sequences and undertake the daunting task of annotation.
Scientists who have spent months or years trying to clone a
particular gene the old-fashioned way know what a service these
groups are providing. Take advantage of this readily accessible
resource. Bookmark your favorite genome project sites and check
back regularly.
Amy
Fluet is a postdoctoral fellow at the University of Colorado
at Boulder.