update July 18, 2020
NAME
taxfetch.py - Given a set of
LOCUS or ACCESSION numbers , create a file containing the
corresponding NCBI taxonomy data, for use by the forester
decorator program.
SYNOPSIS
taxfetch.py
--infile infile --db
NCBI_database --tablefile tablefile [--sep
seperator]
DESCRIPTION
taxfetch
reads infile, containing one or more DNA, RNA
or protein IDs from NCBI databases. IDs can be either LOCUS names
or ACCESSION numbers. Sequences are retrieved from NCBI using the
NCBI Entrez, and written to tablefile.
If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are
set, all requests to the Entrez Eutils will be processed using the
user's Entrez API key.
Note: taxfetch.py is usually run by a helper script, taxfetch.
taxfetch sets PYTHONPATH, as described below.
OPTIONS
--infile - a file containing a single Accession
number of Locus name on each line. This file can be in the form of
a TAB-separated value (.tsv) file, in which the leftmost field has
the Accession of Locus name. The other columns of such a file, if
present, will be ignored.
Example:
XM_009119191 PREDICTED: Brassica
rapa dirigent protein 4 (LOC103842557), mRNA
XM_009119192 PREDICTED: Brassica rapa
dirigent protein 23 (LOC103842558), mRNA
XM_009114648 PREDICTED: Brassica rapa
dirigent protein 5 (LOC103838226), mRNA
XM_009109941 PREDICTED: Brassica rapa
dirigent protein 12-like (LOC103833894), mRNA
--db NCBI_database
- NCBI database from which to retrieve sequences. As described in
the Edirect documentation, databases may include
protein
nuccore
nucleotide
nucgss
nucest
--tablefile tablefile - write taxonomy
information to a file for use by the forester decorator
program. This file is used to add annotation information to a
phylogenetic tree in phyloXML format.
Example:
XM_009119191
TAXONOMY_CODE:3711
TAXONOMY_ID:3711 TAXONOMY_SN:Brassica
rapa SEQ_ACCESSION:XM_009119191
XM_009119192
TAXONOMY_CODE:3711
TAXONOMY_ID:3711 TAXONOMY_SN:Brassica
rapa SEQ_ACCESSION:XM_009119192
XM_009114648
TAXONOMY_CODE:3711
TAXONOMY_ID:3711 TAXONOMY_SN:Brassica
rapa SEQ_ACCESSION:XM_009114648
XM_009109941
TAXONOMY_CODE:3711
TAXONOMY_ID:3711 TAXONOMY_SN:Brassica
rapa SEQ_ACCESSION:XM_009109941
--sep separator - Character used for
delimiting GID or Accession numbers in infile. Default is
comma (,). This is usually only needed if more than one GID is on
a line.
ENVIRONMENT
PYTHONPATH (required) - Path to BioPython. ncbiquery
sets PYTHONPATH to a platform specific directory containing
BioPython, and then runs seqfetch.py. If you run seqfetch.py
directly, you need to set PYTHONPATH manually.
BL_EMAIL (required) - Email address to accompany requests to NCBI
Entrez. Required by NCBI
NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to
NCBI Entrez. If no key is supplied, you may get slower retrieval
times. If you do a large number of requests (eg. more than 3 per
minute) you must supply a key, or NCBI will ramp down your future
requests. See NCBI
Eutil API keys.
REFERENCES
New API Keys for the E-utilities
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.
NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500
BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html
BUGS
1. Unfortunately, it is necessary to post an Entrez
request for each ID given. The reason is that batch requests
retrun taxonomy XML objects that don't tell you the corresponding
sequence ID. In that case, there is no way to tell which ID
corresponds to a particular TaxID. This makes the process slow.
2. The current version only write some of the fields supported by
forester for table output, as shown in the example above. However,
note that archaeopteryx can add a lot of this information, given
at least an accession number and a taxonomy id.
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist