|
|
|
|
Sequence
Database Usage
|
|
|
|
(See
complete list of databases maintained)
(See current download
status of frequently updated databases)
Introduction
Update Procedures
Directories for Frequently
Updated Sequence Databases
About the
BLAST Environment Variables
Performing
a Search on a Database
Introduction
We maintain current copies of
many major fasta-format sequence databases. Databases
that change frequently are updated on a weekly schedule.
Formatted copies of each database are maintained
for all three major platforms available in the Bioinformatics
Core Facility: the GeneMatcher2, the BlastMachine,
and the UNIX hosts.
The BlastMachine and the GeneMatcher2
are the prefered platforms for sequence searches.
Such searches should not normally be done directly
on the UNIX hosts, but these versions of the databases
are available e.g. to retrieve sequences of interest
using the command 'fastacmd'.
For nucleotide databases, we maintain
six-frame translations, and on the GeneMatcher2
we will create codon databases for GeneWise searches
on request.
Update Procedures
For databases that change frequently,
we perform automatic weekly updates. Scripts check
whether a new download is needed, and perform error
checking on databases that are downloaded. For example,
if a database has become smaller since the last
download, the administrator is notified (we have
seen such faulty files appear on the NCBI ftp server).
Databases that are successfully downloaded are then
formatted for the various platforms. The previous
version of each database is retained until the next
download. Oher databases such as the various genome
database releases are added to our collection as
they become available. In general we will keep a
current release and one previous release available.
Directories for Frequently Updated
Sequence Databases
On the Linux fileserver and on
the BlastMachine, there are two top level directories
specifically for frequently updated data. The two
directories are updated in alternating fashion,
typically at weekly intervals. On the Linux and
UNIX hosts, environment variables are automatically
set to point to the current version, both for these
hosts directly and for theBlastMachine. . Note
that once you start a terminal login session, the
value of the environment variable will not change.
Thus one can be sure that, if it is important,
serial searches against a given database name will
always use the same database files, even if an update
occurs during the session. However, if one were
to remain logged in for a long time, one might needlessly
be using an outdated database.
On both the local Linux/UNIX hosts
and the BlastMachine, these two directories are
called db1 and db2.
db1 and db2 currently contain
the ncbi and embl directories. As well, both db1
and db2 contain links back up to the higher level
directories containing static data.
Due to limited disk space on the
GeneMatcher2 system, only a single copy of each
database is kept. New runs will always be against
the most current version of the database.
About the BLAST Environment Variables
UNIX hosts
The variable BLASTDB is set
to the current data directory at login. This variable
is used by NCBI BLAST. It contains the full path
to the current data directory. The root of this
directory is currently /fsnode1/databases/blastdb.
BlastMachine
The pb
blastall command (run on a UNIX host) detects
two environment variables, BLASTDB and PB_BLASTDB.
The value of PB_BLASTDB will override that of
BLASTDB. The value of PB_BLASTDB is set to the
current BlastMachine data directory at login.
The variable BLASTDB is used for NCBI BLAST on
the UNIX hosts (see above).
On the BlastMachine, BLASTDB
is referenced to the machine root, but PB_BLASTDB
is referenced to the data root /paracel/paracel/pbroot.
The value of PB_BLASTDB will be either db1 or
db2.
The "pb ls" command
ignores the environment variables. It is referenced
to where the top level databases directories reside.
Thus, "pb ls" will show the higher level
directory containing db1/ and db2/, but "pb
blastall...." will be referenced inside of
db1/ or db2/ depending on which is current. A
typical blast command would start
pb blastall -d ncbi/nt ........
as the db1 or db2 part of the
path is given by the environment variable.
GeneMatcher2
On the GeneMatcher2, dbsets
are defined which point to the most current data.
Larger databases are broken into several datafiles,
and these datafiles are grouped together by the
dbsets.
Performing a Search on a Database
BlastMachine and UNIX hosts
Refer
to a database using the data directory and database
name. The directory structure and list of filenames
is available here.
Examples:
'ncbi/nt'
or
'tigr/ARG.pep'
or
genomes/human/goldenPath_Jun2002/100/chr7
or
genomes/human/goldenPath_Jun2002/100/*
(all chromosome files at once).
GeneMatcher2
Most databases can be refered
to using a dbset name, without need for a directory
path. e.g. just 'nt' or 'ARG.pep' is sufficient.
However, refering to individual chromosomes
in a genome assembly, for example, requires
use of the path. For further details see here.
|
|
|
|
|