update July  18, 2020
NAME

taxfetch.py - Given a set  of LOCUS or ACCESSION numbers , create a file containing the corresponding NCBI taxonomy data, for use by the forester decorator program.

SYNOPSIS
taxfetch.py --infile infile --db NCBI_database --tablefile  tablefile  [--sep seperator]

DESCRIPTION
taxfetch reads  infile,  containing one or more DNA, RNA or protein IDs from NCBI databases. IDs can be either LOCUS names or ACCESSION numbers. Sequences are retrieved from NCBI using the NCBI Entrez, and written to tablefile.

If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are set, all requests to the Entrez Eutils will be processed using the user's Entrez API key.

Note: taxfetch.py is usually run by a helper script, taxfetch. taxfetch  sets PYTHONPATH, as described below.

OPTIONS
--infile - a file containing a single Accession number of Locus name on each line. This file can be in the form of a TAB-separated value (.tsv) file, in which the leftmost field has the Accession of Locus name. The other columns of such a file, if present, will be ignored.

Example:
XM_009119191    PREDICTED: Brassica rapa dirigent protein 4 (LOC103842557), mRNA
XM_009119192    PREDICTED: Brassica rapa dirigent protein 23 (LOC103842558), mRNA
XM_009114648    PREDICTED: Brassica rapa dirigent protein 5 (LOC103838226), mRNA
XM_009109941    PREDICTED: Brassica rapa dirigent protein 12-like (LOC103833894), mRNA

--db NCBI_database - NCBI database from which to retrieve sequences. As described in the Edirect documentation, databases may include
    protein
    nuccore
    nucleotide
    nucgss
    nucest
--tablefile tablefile - write taxonomy information to a file for use by the forester decorator program. This file is used to add annotation information to a phylogenetic tree in phyloXML format.

Example:

XM_009119191    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009119191
XM_009119192    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009119192
XM_009114648    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009114648
XM_009109941    TAXONOMY_CODE:3711    TAXONOMY_ID:3711    TAXONOMY_SN:Brassica rapa    SEQ_ACCESSION:XM_009109941
--sep separator - Character used for delimiting GID or Accession numbers in infile. Default is comma (,). This is usually only needed if more than one GID is on a line.

ENVIRONMENT
PYTHONPATH (required) - Path to BioPython. ncbiquery sets PYTHONPATH to a platform specific directory containing BioPython, and then runs seqfetch.py. If you run seqfetch.py directly, you need to set PYTHONPATH manually.

BL_EMAIL (required) - Email address to accompany requests to NCBI Entrez. Required by NCBI

NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to NCBI Entrez. If no key is supplied, you may get slower retrieval times. If you do a large number of requests (eg. more than 3 per minute) you must supply a key, or NCBI will ramp down your future requests. See NCBI Eutil API keys.

REFERENCES
New API Keys for the E-utilities
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.

NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500
BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

BUGS
1. Unfortunately, it is necessary to post an Entrez request for each ID given. The reason is that batch requests retrun taxonomy XML objects that don't tell you the corresponding sequence ID.  In that case, there is no way to tell which ID corresponds to a particular TaxID. This makes the process slow.
2. The current version only write some of the fields supported by forester for table output, as shown in the example above. However, note that archaeopteryx can add a lot of this information, given at least an accession number and a taxonomy id.

AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist