taxfetch.py - retrieve taxonomy data from NCBI

update July 18, 2020

NAME

taxfetch.py - Given a set of LOCUS or ACCESSION numbers , create a file containing the corresponding NCBI taxonomy data, for use by the forester decorator program.

SYNOPSIS

taxfetch.py --infile infile --db NCBI_database --tablefile tablefile [--sep seperator]

DESCRIPTION

taxfetch reads infile, containing one or more DNA, RNA or protein IDs from NCBI databases. IDs can be either LOCUS names or ACCESSION numbers. Sequences are retrieved from NCBI using the NCBI Entrez, and written to tablefile.

If the environment variables $BL_EMAIL and $NCBI_ENTREZ_KEY are set, all requests to the Entrez Eutils will be processed using the user's Entrez API key.

Note: taxfetch.py is usually run by a helper script, taxfetch. taxfetch sets PYTHONPATH, as described below.

OPTIONS

--infile - a file containing a single Accession number of Locus name on each line. This file can be in the form of a TAB-separated value (.tsv) file, in which the leftmost field has the Accession of Locus name. The other columns of such a file, if present, will be ignored.

Example:

XM_009119191 PREDICTED: Brassica rapa dirigent protein 4 (LOC103842557), mRNAXM_009119192 PREDICTED: Brassica rapa dirigent protein 23 (LOC103842558), mRNAXM_009114648 PREDICTED: Brassica rapa dirigent protein 5 (LOC103838226), mRNAXM_009109941 PREDICTED: Brassica rapa dirigent protein 12-like (LOC103833894), mRNA

--db NCBI_database - NCBI database from which to retrieve sequences. As described in the Edirect documentation, databases may include

    protein
    nuccore
    nucleotide
    nucgss
    nucest

--tablefile tablefile - write taxonomy information to a file for use by the forester decorator program. This file is used to add annotation information to a phylogenetic tree in phyloXML format.

Example:

XM_009119191 TAXONOMY_CODE:3711 TAXONOMY_ID:3711 TAXONOMY_SN:Brassica rapa SEQ_ACCESSION:XM_009119191XM_009119192 TAXONOMY_CODE:3711 TAXONOMY_ID:3711 TAXONOMY_SN:Brassica rapa SEQ_ACCESSION:XM_009119192XM_009114648 TAXONOMY_CODE:3711 TAXONOMY_ID:3711 TAXONOMY_SN:Brassica rapa SEQ_ACCESSION:XM_009114648XM_009109941 TAXONOMY_CODE:3711 TAXONOMY_ID:3711 TAXONOMY_SN:Brassica rapa SEQ_ACCESSION:XM_009109941

--sep separator - Character used for delimiting GID or Accession numbers in infile. Default is comma (,). This is usually only needed if more than one GID is on a line.

ENVIRONMENT

PYTHONPATH (required) - Path to BioPython. ncbiquery sets PYTHONPATH to a platform specific directory containing BioPython, and then runs seqfetch.py. If you run seqfetch.py directly, you need to set PYTHONPATH manually.

BL_EMAIL (required) - Email address to accompany requests to NCBI Entrez. Required by NCBI

NCBI_ENTREZ_KEY (optional) - Unique identifier for requests to NCBI Entrez. If no key is supplied, you may get slower retrieval times. If you do a large number of requests (eg. more than 3 per minute) you must supply a key, or NCBI will ramp down your future requests. See NCBI Eutil API keys.

REFERENCES

New API Keys for the E-utilities
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

NCBI Entrez Help Manual at http://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options.

NCBI E-utilities Quick Start http://www.ncbi.nlm.nih.gov/books/NBK25500

BioPython Bio::Entrez at http://biopython.org/DIST/docs/api/Bio.Entrez-module.html

BUGS

1. Unfortunately, it is necessary to post an Entrez request for each ID given. The reason is that batch requests retrun taxonomy XML objects that don't tell you the corresponding sequence ID. In that case, there is no way to tell which ID corresponds to a particular TaxID. This makes the process slow.
2. The current version only write some of the fields supported by forester for table output, as shown in the example above. However, note that archaeopteryx can add a lot of this information, given at least an accession number and a taxonomy id.

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist