update July  18, 2020

seqfetch.py - Given a set of GI (UID) or ACCESSION numbers , create a file containing the corresponding sequence entries

seqfetch.py --infile infile --outfile  outfile [--query entrez_query_statement] [--format sequence_format] [--db NCBI_database] [--sep seperator]

seqfetch reads  infile,  containing one or more DNA, RNA or protein IDs from NCBI databases. IDs can be either GI numbers or ACCESSION numbers. Sequences are retrieved from NCBI using the NCBI Entrez, and written to outfile.
If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are set, all requests to the Entrez Eutils will be processed using the user's Entrez API key.


Note: seqfetch.py is usually run by a helper script, seqfetch. seqfetch  sets PYTHONPATH, as described below.

--query entrez_query_statement  - an Entrez query statement as described in the NCBI Entrez Help.

Note: This option will only work if infile contains GI numbers. The current implementation of NCBI's epost, which is needed for a query, does not support ACCESSION numbers.

--query '1:250000[SLEN]'

would retrieve sequences less then or equal to 250,000 bp in length, which might be useful if you were only interested in sequences up to the size of BAC inserts, but not complete chromosomes.
--format  sequence_format - sequence format for output, as described in the EDirect Appendices .
Formats may include
		-format        -mode         Report Type
 		_______         _____  	     ___________
acc Accession Number est EST Report fasta FASTA fasta xml TinySeq XML fasta_cds_aa FASTA of CDS Products fasta_cds_na FASTA of Coding Regions ft Feature Table gb GenBank Flatfile gb xml GBSet XML gbc xml INSDSet XML gbwithparts GenBank with Contig Sequences gp GenPept Flatfile gp xml GBSet XML gpc xml INSDSet XML gss GSS Report native text Seq-entry ASN.1 native xml Bioseq-set XML seqid Seq-id ASN.1

--db NCBI_database - NCBI database from which to retrieve sequences. As described in the Edirect documentation, databases may include

--sep separator - Character used for delimiting GID or Accession numbers in infile. Default is comma (,). This is usually only needed if more than one GID is on a line.

infile contains a list of IDs, one per line. Comments are lines beginning with hash symbols (#) and can be placed anywhere in the file. Example:

# BLASTN 2.2.26+
# Query:
# RID: TY6949DZ014
# Database: nr
# Fields: subject gi
# 6 hits found
# BLAST processed 1 queries

PYTHONPATH (required) - Path to BioPython. ncbiquery sets PYTHONPATH to a platform specific directory containing BioPython, and then runs seqfetch.py. If you run seqfetch.py directly, you need to set PYTHONPATH manually.

BL_EMAIL (required) - Email address to accompany requests to NCBI Entrez. Required by NCBI

NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to NCBI Entrez. If no key is supplied, you may get slower retrieval times. If you do a large number of requests (eg. more than 3 per minute) you must supply a key, or NCBI will ramp down your future requests. See NCBI Eutil API keys.

New API Keys for the E-utilities

NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.

NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500
BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

1. Retrieval of large numbers of sequences from NCBI in a single job may not always work. We should revise seqfetch.py to break up retrievals into chunks of perhaps 500 or 1000 sequences at a time.

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2