update February 17, 2021
uniqid.py - Encode sequence names by replacing them with a random numerical code, or decode names by restoring the original names.
uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --encode sourcein sourceencoded tsvout
uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --encodesame sourcein sourceencoded tsvin
uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --decode sourceencoded sourceout tsvin
programs truncate sequence names longer than 10 characters.
uniqid.py -encode can be run to replace sequence names with unique
numerical codes of 9 characters in length. A separate TSV file is
also created that maps the original names to the encoded names.
These can be used as input for Phylip programs to restore the
original names to output files.
While this script was written for use with Phylip files, you can
decode ANY text file produced by any program or series of
programs, as long as the first step was to encode the sequence
names in the original fasta file.
--encode (default) - The first three filenames on the command line are read as sourcein, the original source file sourcein, sourceencoded sequences in which the description line is replaced with a unique ID; and tsvout, a comma-separated value file containing the unique identifier and the corresponding definition line
--encodesame - Encode another file, substituting in the same random names from a previous run using --encode. This makes it possible to encode two or more files using the same random names, so that all output files generated can be decoded with a single csv file.
The first three filenames on the command line are read as sourcein, the original source file; sourceencoded, the sourceencoded in which the description line is replaced with a unique ID generated previously by --encode; and tsvin, a comma-separated value file containing the unique identifier and the corresponding definition line
--decode - The first three filenames on the command line are read as sourceencoded, any text file containing unique IDs generated from a previous run using --encode; sourceout, the output file in which the unique ID is replaced by the original name, or the name plus parts of the definition line; tsvin, the tsv file generated by a previous run using --encode. If options -f, s- and -nf were set during encoding, the same options must be used when decoding.
-f list_of_fields - similar to -f in the Unix cut command. A comma-separated list of fields to be written to textout when decoding files. eg. -f 1,2 would write out both fields 1 and 2. Default; -f 1
-s seperator - seperator is a character to use as the seperator when parsing a definition line into fields. default = "\t ", a TAB character. eg. -s "," will create a comma-separated tsvfile
-nf string - (default: !%) String is one or more characters to begin the unique identifier, which which the definition line is replaced. The string should be some set of 1 or more characters that are not expected to be found in the input file. Some Phylip programs appear to translate the underscore character '_' to a blank, which would prevent decoding. Don't use '_' as part of the string.
Note: Phylip programs don't allow special characters in sequence names eg :() etc. Never use these characters as part of string.
To use these sequences with Phylip, run
phylcnv.py -inf fasta -outf pint
test2.uniq.pint is a Phylip interleaved sequence file that will
work with any Phylip sequence program:
If you ran protpars, the resultant parsimony tree would be
written to a file called outtree:
To restore the original sequence names,
uniqid.py --decode outtree test2.outtree
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2