update July  18, 2020

taxfetch.py - Given a set  of LOCUS or ACCESSION numbers , create a file containing the corresponding NCBI taxonomy data, for use by the forester decorator program.

taxfetch.py --infile infile --db NCBI_database --tablefile  tablefile  [--sep seperator]

taxfetch reads  infile,  containing one or more DNA, RNA or protein IDs from NCBI databases. IDs can be either LOCUS names or ACCESSION numbers. Sequences are retrieved from NCBI using the NCBI Entrez, and written to tablefile.

If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are set, all requests to the Entrez Eutils will be processed using the user's Entrez API key.

Note: taxfetch.py is usually run by a helper script, taxfetch. taxfetch  sets PYTHONPATH, as described below.

--infile - a file containing a single Accession number of Locus name on each line. This file can be in the form of a TAB-separated value (.tsv) file, in which the leftmost field has the Accession of Locus name. The other columns of such a file, if present, will be ignored.

XM_009119191    PREDICTED: Brassica rapa dirigent protein 4 (LOC103842557), mRNA
XM_009119192    PREDICTED: Brassica rapa dirigent protein 23 (LOC103842558), mRNA
XM_009114648    PREDICTED: Brassica rapa dirigent protein 5 (LOC103838226), mRNA
XM_009109941    PREDICTED: Brassica rapa dirigent protein 12-like (LOC103833894), mRNA

--db NCBI_database - NCBI database from which to retrieve sequences. As described in the Edirect documentation, databases may include
--tablefile tablefile - write taxonomy information to a file for use by the forester decorator program. This file is used to add annotation information to a phylogenetic tree in phyloXML format.


XM_009119191    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009119191
XM_009119192    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009119192
XM_009114648    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009114648
XM_009109941    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009109941
--sep separator - Character used for delimiting GID or Accession numbers in infile. Default is comma (,). This is usually only needed if more than one GID is on a line.

PYTHONPATH (required) - Path to BioPython. ncbiquery sets PYTHONPATH to a platform specific directory containing BioPython, and then runs seqfetch.py. If you run seqfetch.py directly, you need to set PYTHONPATH manually.

BL_EMAIL (required) - Email address to accompany requests to NCBI Entrez. Required by NCBI

NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to NCBI Entrez. If no key is supplied, you may get slower retrieval times. If you do a large number of requests (eg. more than 3 per minute) you must supply a key, or NCBI will ramp down your future requests. See NCBI Eutil API keys.

New API Keys for the E-utilities

NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.

NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500
BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

1. Unfortunately, it is necessary to post an Entrez request for each ID given. The reason is that batch requests retrun taxonomy XML objects that don't tell you the corresponding sequence ID.  In that case, there is no way to tell which ID corresponds to a particular TaxID. This makes the process slow.
2. The current version only write some of the fields supported by forester for table output, as shown in the example above. However, note that archaeopteryx can add a lot of this information, given at least an accession number and a taxonomy id.

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2