update February 17, 2021
NAME
uniqid.py - Encode sequence names by replacing them with a random numerical code, or decode names by restoring the original names.
SYNOPSIS
uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --encode sourcein sourceencoded tsvout
uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --encodesame sourcein sourceencoded tsvin
uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --decode sourceencoded sourceout tsvin
DESCRIPTION
The Phylip
programs truncate sequence names longer than 10 characters.
uniqid.py -encode can be run to replace sequence names with unique
numerical codes of 9 characters in length. A separate TSV file is
also created that maps the original names to the encoded names.
These can be used as input for Phylip programs to restore the
original names to output files.
IMPORTANT -
While this script was written for use with Phylip files, you can
decode ANY text file produced by any program or series of
programs, as long as the first step was to encode the sequence
names in the original fasta file.
OPTIONS
--encode (default) - The first three filenames on the command line are read as sourcein, the original source file sourcein, sourceencoded sequences in which the description line is replaced with a unique ID; and tsvout, a comma-separated value file containing the unique identifier and the corresponding definition line
--encodesame - Encode another file, substituting in the same random names from a previous run using --encode. This makes it possible to encode two or more files using the same random names, so that all output files generated can be decoded with a single csv file.
The first three filenames on the command line are read as sourcein, the original source file; sourceencoded, the sourceencoded in which the description line is replaced with a unique ID generated previously by --encode; and tsvin, a comma-separated value file containing the unique identifier and the corresponding definition line
--decode - The first three filenames on the command line are read as sourceencoded, any text file containing unique IDs generated from a previous run using --encode; sourceout, the output file in which the unique ID is replaced by the original name, or the name plus parts of the definition line; tsvin, the tsv file generated by a previous run using --encode. If options -f, s- and -nf were set during encoding, the same options must be used when decoding.
-f list_of_fields - similar to -f in the Unix cut command. A comma-separated list of fields to be written to textout when decoding files. eg. -f 1,2 would write out both fields 1 and 2. Default; -f 1
-s seperator - seperator is a character to use as the seperator when parsing a definition line into fields. default = "\t ", a TAB character. eg. -s "," will create a comma-separated tsvfile
-nf string - (default: !%) String is one or more characters to begin the unique identifier, which which the definition line is replaced. The string should be some set of 1 or more characters that are not expected to be found in the input file. Some Phylip programs appear to translate the underscore character '_' to a blank, which would prevent decoding. Don't use '_' as part of the string.
Note: Phylip programs don't allow special characters in sequence names eg :[]() etc. Never use these characters as part of string.
To use these sequences with Phylip, run
phylcnv.py -inf fasta -outf pint
test2.uniq.fsa test2.uniq.pint
test2.uniq.pint is a Phylip interleaved sequence file that will
work with any Phylip sequence program:
4 194
!_7662224
MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQ
!_7203510
MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQ
!_2958197
MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGT
!_3585127
MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGT
NGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANT
NGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANT
NGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATY
NGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATY
PVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMG
PVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMG
PDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKG
PDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKG
TVKRTEVVYKDNTRISELKIHAYFAPVN----------------
TVKRTEVVYKDNTRISELKIHAYFAPVN----------------
TVRRTEISYTGNIRISELNIHVLYTPM-----------------
TVRRTEISYTGNIRISELNIHVLYTPM-----------------
If you ran protpars, the resultant parsimony tree would be
written to a file called outtree:
(((!_3585127,!_2958197),!_7203510),!_7662224);
To restore the original sequence names,
uniqid.py --decode outtree test2.outtree
test2.tsv
test2.outtree:
(((ACG44408,NP_001151815),ACG36197),NP_001149637);
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist