In Situ

Sequencing Sites

by Amy Fluet

(Posted April 16, 1999 · Issue 52)

The Human Genome Project might get the majority of the attention from the popular press, but scientists have also been busy generating sequence data from 32,000 other species. These sequences come from well-organized, multicenter sequencing projects, and from the efforts of individual researchers. Some projects are still in the preliminary, but important, mapping stages. Others, including those for E. coli and Saccharomyces cerevisiae, have completed their sequencing efforts.

The availability of this information on the Web has led to a new era. Without lifting a pipetman, researchers can virtually clone the C. elegans homologue of a Drosophila transcription factor, look for disease-resistant genes in maize, or compare virulence factors in Pseudomonas species. The Web provides free access to much of this sequencing information, but it can take a little work to find the sites and learn the various programs used for displaying sequence data.

The best place to start when looking for nucleotide sequences is the National Center for Biotechnology Information (NCBI) at the National Library of Medicine. This site provides several sequence search and comparison options. Visitors can navigate easily among the different services, which often use similar formats for data input.

Late in 1992, NCBI became responsible for maintaining the GenBank database of nucleotide sequences. Visitors get several options for searching GenBank, including the Entrez browser. With Entrez, users can limit this text-based search to specific text fields, such as organism or accession, and can refine search results to narrow their focus.

Another method of locating nucleotide sequences from GenBank and other databases is a search tool called BLAST (Basic Local Alignment Search Tool). This search takes a sequence of interest - either protein or nucleotide - and pulls out similar sequences from a specified database. The Advanced BLAST page allows organism-specific searches. NCBI's Microbial Genomes BLAST Databases is a search tool for unfinished microbial genome projects.

If a search in Entrez pulls up an overwhelming number of entries, you might try the UniGene database. Currently covering only human, mouse, and rat sequences, each database contains a "nonredundant set of gene-oriented clusters." An entry for a specific gene might include links to expressed sequence tags (ESTs), mRNA sequences, expression information, and the relevant Online Mendelian Inheritance in Man (OMIM) entry.

A few other useful sites from NCBI are the Taxonomy Browser, the Database of Expressed Sequence Tags (dbEST), which contains more than two million ESTs, and Entrez Genomes. This last resource displays sequence data from completed genome projects, as well as mapping information from others, using a graphical interface. Taxonomic groupings and a search feature are both used to access the data. The Genomes help page gives a clearly written overview of the site and details about using the graphical interface and sequence reports.

Because GenBank might not contain the latest sequencing information from a specific project, it can be useful to look at that project's Web site. You can find such sites at the Agricultural Genome Information System, which was produced by the Department of Plant Biology at the University of Maryland at College Park and the Genome Informatics group at the National Agricultural Library. This site functions as a well-organized gateway to genome projects. Although the focus is on livestock and commercially important plants, other organisms are included. The page Databases at the National Agricultural Library provides access to genome information for various groups of organisms, as well as several links to the Crop Genome Databases at Cornell University.

The Agricultural Genome Information System databases, as well as those of several other genome groups, use the so-called ACEDB format for presenting genome information. Originally developed for the C. elegans sequencing project, ACEDB (A C. elegans DataBase) is a powerful graphical interface that displays mapping and sequencing information. Spending a little time with a tutorial or instructional site can make genome surfing a much more efficient process. A few sites to look at are The Plant Genome Database tutorial, the ACEDB Documentation Library, and the ACEDB FAQ.

The subtitle for FlyBase - A Database of the Drosophila Genome - is an understated description of this comprehensive resource. The FlyBase Consortium, along with the Berkeley and European Drosophila Genome Projects, is integrating sequence information with stock descriptions, transposon insertion maps, gene annotation, a bibliography, and chromosomal aberrations. Gene entries are cross-linked with The Interactive Fly, which includes detailed descriptions of the biological functions of Drosophila gene products.

To most nonscientists, the phrase "genome project" is synonymous with the Human Genome Project. In the United States, this effort is being funded by the Department of Energy (DOE) and the National Human Genome Research Institute of the National Institutes of Health. The DOE Human Genome Program: Research in Progress site maintains several resources for researchers. The Human Chromosome Launchpad, for instance, organizes links to genetic information and databases by chromosome number. The well-organized Links to the Genetic World page focuses on international human genome sites.

The Genome Database, a repository for human sequence related information, functions as a "community-curated database." Its organizers at the Johns Hopkins School of Medicine and the Bioinformatics Supercomputing Centre at the Hospital for Sick Children in Toronto invite researchers to add to this extensive encyclopedia of genome material. If the many possible search options appear overwhelming, look at the example searches for useful suggestions on getting started.

GenomeNet, established under Japan's Human Genome Project, is a gateway to genome databases. Along with sequence information from humans and other species, the site includes access to the powerful KEGG resource (Kyoto Encyclopedia of Genes and Genomes), a database of metabolic and regulatory pathways.

One of the only other sequencing efforts to make it into the popular press is the C. elegans Genome Project. In December 1998, the two collaborators on this project, The Sanger Centre and the Genome Sequencing Center at Washington University at St. Louis, declared victory; they had sequenced almost 90 percent of the C. elegans genome.

Caenorhabditis elegans Genetics and Genomics, a part of the Caenorhabditis elegans WWW Server, is the gateway to C. elegans genome information. This site incorporates information from both sequencing centers. One can run BLAST searches on released and unreleased - that is, not yet posted to GenBank - sequences, and multiple mirror sites provide access to ACEDB.

The Saccharomyces Genome Database maintains a Sequence Analysis and Tools page, yeast maps, and the SGD Worm-Yeast Protein Comparison. The latter resource capitalizes on these two completed sequencing projects. Researchers can use either C. elegans or S. cerevisiae gene names or keywords to pull out related members from the other organism. Results show sequence alignments and protein similarity trees.

The Pseudomonas Genome Project - a collaboration of the Cystic Fibrosis Foundation, the University of Washington Genome Center, and PathoGenesis Corporation - had completed 99 percent of its sequencing task by September 1998. Faced with the labor-intensive job of annotation, they turned to researchers in the field and developed the Pseudomonas aeruginosa Community Annotation Project. If successful, this project may serve as a model for other sequencing groups.

Sequencing projects around the world are advancing methodically toward their respective goals. The progress they make translates into more information for researchers, as they accumulate new sequences and undertake the daunting task of annotation. Scientists who have spent months or years trying to clone a particular gene the old-fashioned way know what a service these groups are providing. Take advantage of this readily accessible resource. Bookmark your favorite genome project sites and check back regularly.

Amy Fluet is a postdoctoral fellow at the University of Colorado at Boulder.

Send us your comments and ideas for future articles.

Endlinks

Plant & Animal Genome VIII Conference - will be held January 9-13, 2000, in San Diego, California. Past abstracts are available online.

The Institute for Genomic Research - another large sequencing center; has instituted several sequencing projects. See the frequently asked questions about the TIGR Gene Indices for an overview of the site's contents.

ExPASy Molecular Biology Server - a large collection of online programs for analyzing protein sequences.

DNA and Protein Analysis Toolkit - well-annotated links to online tools for sequence analysis, organized by function.

CLUE - a wide collection of useful links for sequencing information.

Web sites mentioned in this column: