HOME  |   AMDeC |   Columbia Genome Center |   Contact Us|
About The Center
Introduction
What you can do
Using the Facility
Hardware
Software
Databases
Staff
Services for Users
Access
Manuals
Support
Registration
Resources
caWorkBench3.0
Algorithm Reference
Tutorials & Examples
Links
Maps & Directions
Contact Us
 

 

Sequence Database Usage

 

(See complete list of databases maintained)
(See current download status of frequently updated databases)

Introduction
Update Procedures
Directories for Frequently Updated Sequence Databases
About the BLAST Environment Variables
Performing a Search on a Database

 

Introduction

We maintain current copies of many major fasta-format sequence databases. Databases that change frequently are updated on a weekly schedule. Formatted copies of each database are maintained for all three major platforms available in the Bioinformatics Core Facility: the GeneMatcher2, the BlastMachine, and the UNIX hosts.

The BlastMachine and the GeneMatcher2 are the prefered platforms for sequence searches. Such searches should not normally be done directly on the UNIX hosts, but these versions of the databases are available e.g. to retrieve sequences of interest using the command 'fastacmd'.

For nucleotide databases, we maintain six-frame translations, and on the GeneMatcher2 we will create codon databases for GeneWise searches on request.

 

Update Procedures

For databases that change frequently, we perform automatic weekly updates. Scripts check whether a new download is needed, and perform error checking on databases that are downloaded. For example, if a database has become smaller since the last download, the administrator is notified (we have seen such faulty files appear on the NCBI ftp server). Databases that are successfully downloaded are then formatted for the various platforms. The previous version of each database is retained until the next download. Oher databases such as the various genome database releases are added to our collection as they become available. In general we will keep a current release and one previous release available.

 

Directories for Frequently Updated Sequence Databases

On the Linux fileserver and on the BlastMachine, there are two top level directories specifically for frequently updated data. The two directories are updated in alternating fashion, typically at weekly intervals. On the Linux and UNIX hosts, environment variables are automatically set to point to the current version, both for these hosts directly and for theBlastMachine. . Note that once you start a terminal login session, the value of the environment variable will not change. Thus one can be sure that, if it is important, serial searches against a given database name will always use the same database files, even if an update occurs during the session. However, if one were to remain logged in for a long time, one might needlessly be using an outdated database.

On both the local Linux/UNIX hosts and the BlastMachine, these two directories are called db1 and db2.

db1 and db2 currently contain the ncbi and embl directories. As well, both db1 and db2 contain links back up to the higher level directories containing static data.

Due to limited disk space on the GeneMatcher2 system, only a single copy of each database is kept. New runs will always be against the most current version of the database.

 

About the BLAST Environment Variables

UNIX hosts

The variable BLASTDB is set to the current data directory at login. This variable is used by NCBI BLAST. It contains the full path to the current data directory. The root of this directory is currently /fsnode1/databases/blastdb.

BlastMachine

The pb blastall command (run on a UNIX host) detects two environment variables, BLASTDB and PB_BLASTDB. The value of PB_BLASTDB will override that of BLASTDB. The value of PB_BLASTDB is set to the current BlastMachine data directory at login. The variable BLASTDB is used for NCBI BLAST on the UNIX hosts (see above).

On the BlastMachine, BLASTDB is referenced to the machine root, but PB_BLASTDB is referenced to the data root /paracel/paracel/pbroot. The value of PB_BLASTDB will be either db1 or db2.

The "pb ls" command ignores the environment variables. It is referenced to where the top level databases directories reside. Thus, "pb ls" will show the higher level directory containing db1/ and db2/, but "pb blastall...." will be referenced inside of db1/ or db2/ depending on which is current. A typical blast command would start

pb blastall -d ncbi/nt ........

as the db1 or db2 part of the path is given by the environment variable.

GeneMatcher2

On the GeneMatcher2, dbsets are defined which point to the most current data. Larger databases are broken into several datafiles, and these datafiles are grouped together by the dbsets.

 

Performing a Search on a Database

BlastMachine and UNIX hosts

Refer to a database using the data directory and database name. The directory structure and list of filenames is available here.

Examples:

'ncbi/nt'

or

'tigr/ARG.pep'

or

genomes/human/goldenPath_Jun2002/100/chr7

or

genomes/human/goldenPath_Jun2002/100/* (all chromosome files at once).

GeneMatcher2

Most databases can be refered to using a dbset name, without need for a directory path. e.g. just 'nt' or 'ARG.pep' is sufficient. However, refering to individual chromosomes in a genome assembly, for example, requires use of the path. For further details see here.

 

 

 

 

 

 

 

 

 


 

 
Suggestions & Problems? Send e-mail to the Webmaster