seqfetch.py - retrieve sequences from NCBI

update July 18, 2020

NAME

seqfetch.py - Given a set of GI (UID) or ACCESSION numbers , create a file containing the corresponding sequence entries

SYNOPSIS

seqfetch.py --infile infile --outfile outfile [--query entrez_query_statement] [--format sequence_format] [--db NCBI_database] [--sep seperator]

DESCRIPTION

seqfetch reads infile, containing one or more DNA, RNA or protein IDs from NCBI databases. IDs can be either GI numbers or ACCESSION numbers. Sequences are retrieved from NCBI using the NCBI Entrez, and written to outfile.

If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are set, all requests to the Entrez Eutils will be processed using the user's Entrez API key.

OPTIONS

Note: seqfetch.py is usually run by a helper script, seqfetch. seqfetch sets PYTHONPATH, as described below.

--query entrez_query_statement - an Entrez query statement as described in the NCBI Entrez Help.

Note: This option will only work if infile contains GI numbers. The current implementation of NCBI's epost, which is needed for a query, does not support ACCESSION numbers.

Example:

--query '1:250000[SLEN]'

would retrieve sequences less then or equal to 250,000 bp in length, which might be useful if you were only interested in sequences up to the size of BAC inserts, but not complete chromosomes.

--format sequence_format - sequence format for output, as described in the EDirect Appendices .

Formats may include

		-format        -mode         Report Type
 		_______         _____  	     ___________
                 acc                         Accession Number
                 est                         EST Report
                 fasta                       FASTA
                 fasta              xml      TinySeq XML
                 fasta_cds_aa                FASTA of CDS Products
                 fasta_cds_na                FASTA of Coding Regions
                 ft                          Feature Table
                 gb                          GenBank Flatfile
                 gb                 xml      GBSet XML
                 gbc                xml      INSDSet XML
                 gbwithparts                 GenBank with Contig Sequences
                 gp                          GenPept Flatfile
                 gp                 xml      GBSet XML
                 gpc                xml      INSDSet XML
                 gss                         GSS Report
                 native             text     Seq-entry ASN.1
                 native             xml      Bioseq-set XML
                 seqid                       Seq-id ASN.1

--db NCBI_database - NCBI database from which to retrieve sequences. As described in the Edirect documentation, databases may include

    protein
    nuccore
    nucleotide
    nucgss
    nucest

--sep separator - Character used for delimiting GID or Accession numbers in infile. Default is comma (,). This is usually only needed if more than one GID is on a line.

INPUT

infile contains a list of IDs, one per line. Comments are lines beginning with hash symbols (#) and can be placed anywhere in the file. Example:

# BLASTN 2.2.26+
# Query:
# RID: TY6949DZ014
# Database: nr
# Fields: subject gi
# 6 hits found
508843
4585272
169079
388521786
502139117
356527659
# BLAST processed 1 queries

ENVIRONMENT

PYTHONPATH (required) - Path to BioPython. ncbiquery sets PYTHONPATH to a platform specific directory containing BioPython, and then runs seqfetch.py. If you run seqfetch.py directly, you need to set PYTHONPATH manually.

BL_EMAIL (required) - Email address to accompany requests to NCBI Entrez. Required by NCBI

NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to NCBI Entrez. If no key is supplied, you may get slower retrieval times. If you do a large number of requests (eg. more than 3 per minute) you must supply a key, or NCBI will ramp down your future requests. See NCBI Eutil API keys.

REFERENCES

New API Keys for the E-utilities
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.

NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500

BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

BUGS

1. Retrieval of large numbers of sequences from NCBI in a single job may not always work. We should revise seqfetch.py to break up retrievals into chunks of perhaps 500 or 1000 sequences at a time.

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist5