update July 18, 2020
seqfetch.py - Given a set of GI
(UID) or ACCESSION numbers , create a file containing the
corresponding sequence entries
--infile infile --outfile
outfile [--query entrez_query_statement]
[--format sequence_format] [--db NCBI_database]
reads infile, containing one or more DNA, RNA
or protein IDs from NCBI databases. IDs can be either GI numbers
or ACCESSION numbers. Sequences are retrieved from NCBI using the
NCBI Entrez, and written to outfile.
If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are
set, all requests to the Entrez Eutils will be processed using the
user's Entrez API key.
Note: seqfetch.py is usually run by a helper script,
seqfetch. seqfetch sets PYTHONPATH, as described below.
- an Entrez query statement as described in the NCBI Entrez Help.
Note: This option will only work if infile
contains GI numbers. The current implementation of NCBI's epost,
which is needed for a query, does not support ACCESSION numbers.
would retrieve sequences less then or equal to 250,000 bp in
length, which might be useful if you were only interested in
sequences up to the size of BAC inserts, but not complete
--format sequence_format -
sequence format for output, as described in the EDirect
Formats may include
-format -mode Report Type
_______ _____ ___________
acc Accession Number
est EST Report
fasta xml TinySeq XML
fasta_cds_aa FASTA of CDS Products
fasta_cds_na FASTA of Coding Regions
ft Feature Table
gb GenBank Flatfile
gb xml GBSet XML
gbc xml INSDSet XML
gbwithparts GenBank with Contig Sequences
gp GenPept Flatfile
gp xml GBSet XML
gpc xml INSDSet XML
gss GSS Report
native text Seq-entry ASN.1
native xml Bioseq-set XML
seqid Seq-id ASN.1
- NCBI database from which to
retrieve sequences. As described in the Edirect documentation,
databases may include
--sep separator - Character used for delimiting GID
or Accession numbers in infile. Default is comma (,). This
is usually only needed if more than one GID is on a line.
infile contains a list of IDs, one per line.
Comments are lines beginning with hash symbols (#) and can be
placed anywhere in the file. Example:
# BLASTN 2.2.26+
# RID: TY6949DZ014
# Database: nr
# Fields: subject gi
# 6 hits found
# BLAST processed 1 queries
PYTHONPATH (required) - Path to BioPython. ncbiquery
sets PYTHONPATH to a platform specific directory containing
BioPython, and then runs seqfetch.py. If you run seqfetch.py
directly, you need to set PYTHONPATH manually.
BL_EMAIL (required) - Email address to accompany requests to NCBI
Entrez. Required by NCBI
NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to
NCBI Entrez. If no key is supplied, you may get slower retrieval
times. If you do a large number of requests (eg. more than 3 per
minute) you must supply a key, or NCBI will ramp down your future
requests. See NCBI
Eutil API keys.
New API Keys for the E-utilities
NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.
NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500
BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html
1. Retrieval of large numbers of sequences from NCBI in
a single job may not always work. We should revise seqfetch.py to
break up retrievals into chunks of perhaps 500 or 1000 sequences
at a time.
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2