BIODIVERSITY AND MOLECULAR

SYSTEMATIC DATABASES.

Marc W. Allard, Louis Weintraub Assistant Professor of Biology

343 Lisner Hall, Department of Biological Sciences, The George Washington University, Washington, DC 20052

Tel (202)994-7065 (office), Fax (202)994-6100

e-mail mwallard@gwu.edu

web http://www.gwu.edu/~clade/faculty/allard

Key words: databases, molecular systematics, and biodiversity.

This paper is in press in a symposium volume. First International Conference on Biodiversity and Renewable Natural Resources Preservation, Al Akhawayn University, Ifrane, Morocco.

© Copyright 1999 by Marc W. Allard. All Rights Reserved.

Introduction

To better understand biodiversity, many scientists are collecting information concerning the genetic variation found among organisms. This genetic data often is used to supplement the morphological variation observed among taxa. Genetic information provides additional characters with which to distinguish taxa. Sequence data has become a powerful tool for understanding the systematic relationships of taxa and for assisting managers in the protection and conservation of threatened and endangered organisms. While most of the genetic data found in international databases are from laboratory organisms, crop plants, domestic animals and humans, this information is nonetheless extremely valuable for comparative molecular approaches. This contribution is a brief review of the databases that have been used for studies of biodiversity and systematics.

Searching Information by subject.

Probably one of the simplest methods of searching the world wide web is to choose key words such as molecular biology or biodiversity. If you do this sort of broad search on the internet then you will find a large number of databases. An alternative approach is to go to a central web site at a University such as Pedro's biomolecular research tools. http://www.iastate.edu/~pedro/research_tools.html (Pedro),
Berkeley Phylogenetic Resources (U.C. Berkeley),
or the Harvard Biological links site http://www.mcb.harvard.edu/BioLinks.html. Most large museums and research institutes also maintain web sites. A number of these web sites have been listed in Table 1 for your ease. As you explore each site, remember to read the "read me" and "help" files associated with each database, as many of these data files are subsets of larger databases. For example, (http://www.ncbi.nlm.nih.gov/Web/Genbank/ is actually a consortium of databases. It includes the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory databases (EMBL), and GenBank at the National Center for Biotechnology Information (NCBI). If you examine Entrez, another even larger database at the NCBI, http://www.ncbi.nlm.nih.gov/Entrez/ you can search all of the GenBank database as well as numerous other databases including Medline, SWISS-PROT, PIR, PDB, PRF, dbEST, US and European patent offices.

The NCBI is the largest selection of public databases in the USA and this web site is constantly being updated with new information and resources. For example, in the last few years the NCBI has added an interactive taxonomy database which lists the classification for every species which is deposited in the database. This allows investigators to find which proteins and sequences are available for a particular taxonomic group. They have also created a new database for submitting sequence alignments as well as providing other resources. The best way to keep track of all of the changes at NCBI is to read their online newsletter (http://www.ncbi.nlm.nih.gov/Web/Newsltr/index.html).

One of the main message I wish to give readers here, is that there is a large quantity of information on the internet that is either directly or indirectly related to biodiversity and molecular systematics. This short contribution will introduce colleagues to only a small fraction of the possible resources available on the internet. After visiting the addresses provide here, I hope that investigators will go on to do their own web explorations.

Database structure

Virtually all of the publicly available genetic data is present in the largest of the databases, such as GenBank, Entrez, and PIR. Many of the other databases are subdivisions of these large databases. Subdivision within GenBank itself are mostly separated by animal group. For example much of the data comes from either the primate or rodent subdivisions, and within these sections most of the data comes from human, lab rat or mouse. These are the taxa that are most commonly sequenced due to their importance to the biomedical community. Other specialized databases that are taxon specific, and often have their own web addresses, include the following: Drosophila, maize, rice, C. elegans, human, fungus, yeast, zebrafish, various prokaryotes, rat and mouse, http://www.rodentia.com/wmc/. There are other organismal subdivisions as well and this continues to grow as the field of genomics expands.

Databases also are organized by specific genes such as rRNAs http://www-rrna.uia.ac.be/rrna/index.html), by functional listings such as cloning vectors, protein folding structures, and regulatory elements. There are also sites with chromosomal mapping information, and expressed sequence tag databases, dbEST, http://www.ncbi.nlm.nih.gov/dbEST/index.html. Each of these databases represents a specialized user group whose members have organized a subset of the data which they find most valuable to their ongoing research.

Software relating to biodiversity and molecular systematics is available on the internet as well, both as down-loadable shareware and as software that one may run interactively from a personal computer. Basically one can either download software applications for use on your own computer, or one can send your data out to someone else’s computer (e.g. a server) and they do the calculations and send back the results. A server is the computer designated and set up to do a particular task, usually in an automated way. One shareware application, ClustalX, is an algorithm to conduct multiple sequence alignments (http://www.csc.fi/molbio/progs/clustalw/clustalw.html. To analyze the resulting alignments with parsimony based methods one should explore the software available at the Willi Hennig Society web page http://www.zoo.toronto.edu/~mes/hennig/hennig.html. Not all of the software is free but much of it may be obtained for a limited fee.

Whenever one logs onto a database always read the short introductory "read me" file. These files may be the only readily available information on how to use the database, and they may be the only explicit statement provided describing what is actually being done to the data one is submitting. These files will also describe the format that your data should be in for proper processing.

Note that often several different web sites will allow one to utilize the same database, although each site may have their own specialized options. Detail information on the specific use of the database is listed in the help and read me files and these should be followed. In fact, as many of these web sites are frequently changing, one should constantly monitor the read me file as they are the most up-to-date information on the best use of the database of interest. Most sites are constantly being updated and modified and thus one should expect significant improvements every year. I expect this trend to continue for the next 10 years or so as the technological improvements continue advancing at a rapid rate.

Depositing genetic information

Two of the most useful things you can do with the available databases are to download or to deposit genetic information. GenBank provides several automatic services. Most of these require simple typed instructions to be included in an email to a particular server. In many cases one may send or receive data directly on the web page as well. Most of these web sites have information available to anyone who wishes to use their programs. For example to get information on BLAST searches just send the email message "help" to blast-help@ncbi.nlm.nih.gov., or go to the web address http://www.ncbi.nlm.nih.gov/BLAST/.

Searching databases by comparing one sequence to all of the publicly available sequences is also extremely useful, especially if one is looking for new data with which to compare your sequences. Basic searches can be accomplished with the BLAST program. By looking for the best 50 to 100 matches of the newly collected sequence, you can get a good idea of what is available in the database. If the goal is to compare an unknown sequence to the databases then there is a good chance that it will help to determine what kind of gene it is and whether or not it is new.

The basic BLAST search needs a few simple commands and a sequence to compare to the database. One should select the type of search to be done (nucleotide or protein data), the database to search (Data library), the number of matches to examine or the percent sequence similarity allowed. This is currently directly available on the web as well as by email, adding flexibility depending on the computer connection that is available to an investigator.

Alignments

Recently, a few databases also accept sequence alignment submissions, such as EBI and NCBI. One should look out for DS numbers in the literature as these designate an alignment accession number that is deposited in the EBI database. Try sending an email to netserv@ebi.ac.uk with the message Send file ALIGN:DS35643.DAT. These databases are also automated so that one can down-load published alignment. Many scientific societies and journals are requiring the authors to put their data and alignments into public databases so that the results can be duplicated and that all of the evidence is available to the scientific community.

Systematists, conservationists and those interested in biodiversity are starting to catalogue the databases that they find most useful (Blake et al. 1994, Ashburner and Goodman 1997). They also are starting to build their own databases to make valuable systematic information available to a large user group. Some of the systematic databases which one might wish to explore include tree and matrix databases from published systematic literature (http://phylogeny.arizona.edu/tree/phylogeny.html, http://herbaria.harvard.edu/treebase/. These include both sequence alignments and morphological matrices. Museum and university collections are also being cross listed so that one will know what specimens are available from the various collections (http://research.amnh.org/entomology/, http://biodiversity.uno.edu/). Systematic literature database are also available (http://www.zoo.toronto.edu/~mes/hennig/education.html).

Acknowledgments

This research was funded by NSF grant (DEB-9629319) to MWA. Jim Clark and Diana Lipscomb provided helpful editorial suggestions. The George Washington University and the Al Akhawayn University generously provided support for me to attend this conference on biodiversity.

 

References

Ashburner, M. And N. Goodman. 1997. Informatics-genome and genetic databases. Current opinions in Genetics & Development 7:750-756.

Blake, J. A., C. J. Bult, M. J. Donoghue, J. Humphries, and C. Fields. 1994. Interoperability of Biological Data Bases: A meeting report. Syst. Biol. 43:585-589.

 

 

 

Table 1. Biodiversity and molecular systematic databases available on the world wide web.

Phylogenetic databases and informative sites.

Berkeley Phylogenetic Resources (U.C. Berkeley)

http://www.ucmp.berkeley.edu/subway/phylo/phylodat.html

http://loco.biology.unr.edu/archives/rasa/list_of_sites.html

http://www.bio.net/hypermail/MOLECULAR-EVOLUTION/

http://www.public.iastate.edu/~pedro/research_tools.html

http://www.mcb.harvard.edu/BioLinks.html

http://phylogeny.arizona.edu/tree/phylogeny.html

http://herbaria.harvard.edu/treebase/

 

Phylogenetic and Alignment software

http://www.csc.fi/molbio/progs/clustalw/clustalw.html

http://dot.imgen.bcm.tmc.edu:9331/seq-util/Options/readseq.html

http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html

http://www.zoo.toronto.edu/~mes/hennig/hennig.html

 

NCBI/Entrez Data bases/Pub Med.

http://www.ncbi.nlm.nih.gov/

http://weber.u.washington.edu/~hhmiseq/analysis.html

http://www.ebi.ac.uk/

 

Organismal Biodiversity Sites

http://ibs.uel.ac.uk/ibs/

http://nmnhwww.si.edu/msw/

http://www.gwu.edu/~clade/peet.html

http://www.nhm.ukans.edu/~peet/

http://biodiversity.uno.edu/

http://darwin.eeb.uconn.edu/systematics.html

http://www.amnh.org/science/biodiversity/index.html

http://www.conservation.org/SCIENCE/NASA/